View ALTIVECPEM_3512444.PDF datasheet online --- IC-ON-LINE

Datasheet File OCR Text:

ALTIVECPEM/d 2/2002 rev. 2.0 altivec technology programming environments manual f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
altivec is a trademark of motorola, inc. digitaldna is a trademark of motorola, inc. the powerpc name and the powerpc logotype are trademarks of international business machines corporation used by motorola under license from international business machines corporation. this document contains information on a new product under development. motorola reserves the right to change or discontinue thi s product without notice. information in this document is provided solely to enable system and software implementers to use powerpc microprocessors. ther e are no express or implied copyright licenses granted hereunder to design or fabricate powerpc integrated circuits or integrated circuits based on the information in this document. motorola reserves the right to make changes without further notice to any products herein. motorola makes no warranty, represen tation or guarantee regarding the suitability of its products for any particular purpose, nor does motorola assume any liability arising out of the application or use of any product or circuit, and speci?ally disclaims any and all liability, including without limitation consequential or incidental damages. ?ypical parameters can and do vary in different applications. all operating parameters, including ?ypicals must be validated for each customer application by customers technical experts. motorola does not convey any license under its patent rights nor the rights of others. motorola products are not desig ned, intended, or authorized for use as components in systems intended for surgical implant into the body, or other applications intended to support or sust ain life, or for any other application in which the failure of the motorola product could create a situation where personal injury or death may occur. sho uld buyer purchase or use motorola products for any such unintended or unauthorized application, buyer shall indemnify and hold motorola and its of?ers, employees, subsidiaries, af?iates, and distributors harmless against all claims, costs, damages, and expenses, and reasonable attorney fees arising out of, directly or indirectly, any claim of personal injury or death associated with such unintended or unauthorized use, even if such claim alleges that moto rola was negligent regarding the design or manufacture of the part. motorola and are registered trademarks of motorola, inc. motorola, inc. is an equal opportunity/af?mative action employer. motorola literature distribution centers : usa/europe: motorola literature distribution; p.o. box 5405; denver, colorado 80217; tel.: 1-800-441-2447 or 1-303-675-2140/ japan : nippon motorola ltd spd, strategic planning of?e 4-32-1, nishi-gotanda shinagawa-ku, tokyo 141, japan tel.: 81-3-5487-8488 asia/pacifc : motorola semiconductors h.k. ltd.; 8b tai ping industrial park, 51 ting kok road, tai po, n.t., hong kong; tel.: 852-26629298 world wide web address : http://sps.motorola.com/mfax internet : http://motorola.com/sps technical information : motorola inc. sps customer support center 1-800-521-6274; electronic mail address: crc@wmkmail.sps.mot.com. document comments : fax (512) 933-2625, attn: risc applications engineering. world wide web addresses : http://www.mot.com/powerpc http://www.mot.com/netcomm ?motorola inc. 2001. all rights reserved. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
1 2 3 4 5 a 6 glo ind b c d e f overview altivec register set operand conventions addressing modes and instruction set summary cache, exceptions, and memory management altivec instructions glossary of terms and abbreviations index appendix a: instruction set mnemonics - decimal appendix b: instruction set mnemonics - binary appendix c: opcodes - decimal appendix d: opcodes - binary appendix e: forms appendix f: legends g appendix g: revision history f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
1 2 3 4 5 a 6 glo ind b c d e f overview altivec register set operand conventions addressing modes and instruction set summary cache, exceptions, and memory management altivec instructions glossary of terms and abbreviations index appendix a: instruction set mnemonics - decimal appendix b: instruction set mnemonics - binary appendix c: opcodes - decimal appendix d: opcodes - binary appendix e: forms appendix f: legends g appendix g: revision history f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
contents section number title page number motorola contents v contents paragraph number title page number audience ................................................................................................................xx organization......................................................................................................... xxi suggested reading............................................................................................... xxi general information.................................................................................... xxii related documentation .............................................................................. xxii conventions ....................................................................................................... xxiii acronyms and abbreviations............................................................................. xxiv terminology conventions................................................................................. xxvii chapter 1 overview 1.1 overview.............................................................................................................. 1-1 1.2 altivec technology overview ............................................................................. 1-3 1.2.1 levels of altivec isa ...................................................................................... 1-5 1.2.2 features not de?ed by altivec isa.............................................................. 1-6 1.3 altivec architectural model ................................................................................ 1-6 1.3.1 altivec registers and programming model .................................................... 1-6 1.3.2 operand conventions....................................................................................... 1-7 1.3.2.1 byte ordering .............................................................................................. 1-7 1.3.2.2 floating-point conventions ......................................................................... 1-8 1.3.3 altivec addressing modes .............................................................................. 1-9 1.3.4 altivec instruction set................................................................................... 1-11 1.3.5 altivec cache model .................................................................................... 1-12 1.3.6 altivec exception model............................................................................... 1-12 1.3.7 memory management model ........................................................................ 1-12 chapter 2 altivec register set 2.1 overview on the altivec and powerpc registers ............................................... 2-1 2.2 altivec register set overview ............................................................................ 2-3 2.3 registers de?ed by altivec isa ........................................................................ 2-4 2.3.1 altivec vector register file (vrf) ................................................................. 2-4 2.3.2 vector status and control register (vscr).................................................... 2-4 2.3.3 vector save/restore register (vrsave)........................................................ 2-6 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
contents paragraph number title page number vi altivec programming environments manual motorola 2.4 additions to powerpc uisa registers ............................................................... 2-7 2.4.1 powerpc condition register ........................................................................... 2-8 2.5 additions to powerpc oea registers................................................................. 2-9 2.5.1 altivec field added in the powerpc machine state register (msr) ............. 2-9 2.5.2 machine status save/restore registers (srrs) ............................................ 2-10 2.5.2.1 machine status save/restore register 0 (srr0) ...................................... 2-10 2.5.2.2 machine status save/restore register 1 (srr1) ...................................... 2-11 chapter 3 operand conventions 3.1 data organization in memory ............................................................................. 3-1 3.1.1 aligned and misaligned accesses ................................................................... 3-1 3.1.2 altivec byte ordering ..................................................................................... 3-2 3.1.2.1 big-endian byte ordering ........................................................................... 3-3 3.1.2.2 little-endian byte ordering ........................................................................ 3-3 3.1.3 quad word byte ordering example................................................................ 3-3 3.1.4 aligned scalars in little-endian mode ........................................................... 3-4 3.1.5 vector register and memory access alignment ............................................. 3-6 3.1.6 quad-word data alignment ............................................................................ 3-7 3.1.6.1 accessing a misaligned quad word in big-endian mode .......................... 3-8 3.1.6.2 accessing a misaligned quad word in little-endian mode ..................... 3-10 3.1.6.3 scalar loads and stores............................................................................. 3-11 3.1.6.4 misaligned scalar loads and stores.......................................................... 3-11 3.1.7 mixed-endian systems .................................................................................. 3-12 3.2 altivec floating-point instructions?isa ...................................................... 3-12 3.2.1 floating-point modes .................................................................................... 3-13 3.2.1.1 java mode .................................................................................................. 3-13 3.2.1.2 non-java mode.......................................................................................... 3-14 3.2.2 floating-point in?ities ................................................................................. 3-14 3.2.3 floating-point rounding................................................................................ 3-14 3.2.4 floating-point exceptions.............................................................................. 3-14 3.2.4.1 nan operand exception............................................................................ 3-15 3.2.4.2 invalid operation exception ...................................................................... 3-16 3.2.4.3 zero divide exception............................................................................... 3-16 3.2.4.4 log of zero exception............................................................................... 3-16 3.2.4.5 over?w exception ................................................................................... 3-17 3.2.4.6 under?w exception ................................................................................. 3-17 3.2.5 floating-point nans ...................................................................................... 3-17 3.2.5.1 nan precedence......................................................................................... 3-18 3.2.5.2 snan arithmetic ....................................................................................... 3-18 3.2.5.3 qnan arithmetic....................................................................................... 3-18 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
contents paragraph number title page number motorola contents vii 3.2.5.4 nan conversion to integer ........................................................................ 3-18 3.2.5.5 nan production ......................................................................................... 3-18 chapter 4 addressing modes and instruction set summary 4.1 conventions ......................................................................................................... 4-2 4.1.1 execution model.............................................................................................. 4-2 4.1.2 computation modes......................................................................................... 4-2 4.1.3 classes of instructions ..................................................................................... 4-2 4.1.4 memory addressing......................................................................................... 4-3 4.1.4.1 memory operands ....................................................................................... 4-3 4.1.4.2 effective address calculation...................................................................... 4-3 4.2 altivec uisa instructions ................................................................................... 4-4 4.2.1 vector integer instructions............................................................................... 4-4 4.2.1.1 saturation detection .................................................................................... 4-4 4.2.1.2 vector integer arithmetic instructions......................................................... 4-5 4.2.1.3 vector integer compare instructions ......................................................... 4-13 4.2.1.4 vector integer logical instructions............................................................ 4-15 4.2.1.5 vector integer rotate and shift instructions.............................................. 4-16 4.2.2 vector floating-point instructions ................................................................. 4-17 4.2.2.1 floating-point division and square-root.................................................. 4-18 4.2.2.1.1 floating-point division ......................................................................... 4-18 4.2.2.1.2 floating-point square-root ................................................................... 4-19 4.2.2.2 floating-point arithmetic instructions ...................................................... 4-19 4.2.2.3 floating-point multiply-add instructions ................................................. 4-20 4.2.2.4 floating-point rounding and conversion instructions.............................. 4-21 4.2.2.5 floating-point compare instructions......................................................... 4-22 4.2.2.6 floating-point estimate instructions ......................................................... 4-24 4.2.3 load and store instructions ........................................................................... 4-25 4.2.3.1 alignment .................................................................................................. 4-26 4.2.3.2 load and store address generation .......................................................... 4-26 4.2.3.3 vector load instructions............................................................................ 4-27 4.2.3.4 vector store instructions............................................................................ 4-30 4.2.4 control flow .................................................................................................. 4-31 4.2.5 vector permutation and formatting instructions ........................................... 4-31 4.2.5.1 vector pack instructions ............................................................................ 4-31 4.2.5.2 vector unpack instructions........................................................................ 4-33 4.2.5.3 vector merge instructions.......................................................................... 4-34 4.2.5.4 vector splat instructions............................................................................ 4-35 4.2.5.5 vector permute instruction ........................................................................ 4-36 4.2.5.6 vector select instruction............................................................................ 4-36 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
contents viii altivec programming environments manual motorola 4.2.5.7 vector shift instructions ............................................................................ 4-37 4.2.5.7.1 immediate interelement shifts/rotates.................................................. 4-37 4.2.5.7.2 computed interelement shifts/rotates .................................................. 4-38 4.2.5.7.3 variable interelement shifts .................................................................. 4-39 4.2.6 processor control instructions?isa ......................................................... 4-39 4.2.6.1 altivec status and control register instructions ...................................... 4-40 4.2.7 recommended simpli?d mnemonics.......................................................... 4-40 4.3 altivec vea instructions .................................................................................. 4-40 4.3.1 memory control instructions?ea ............................................................ 4-41 4.3.2 user-level cache instructions?ea........................................................... 4-41 chapter 5 cache, exceptions, and memory management 5.1 powerpc shared memory.................................................................................... 5-1 5.2 altivec memory bandwidth management .......................................................... 5-1 5.2.1 software-directed prefetch.............................................................................. 5-2 5.2.1.1 data stream touch ( dst ).............................................................................. 5-2 5.2.1.2 transient streams ........................................................................................ 5-4 5.2.1.3 storing to streams ( dstst )............................................................................ 5-4 5.2.1.4 stopping streams ......................................................................................... 5-5 5.2.1.5 exception behavior of prefetch streams ..................................................... 5-6 5.2.1.6 synchronization behavior of streams ......................................................... 5-7 5.2.1.7 address translation for streams.................................................................. 5-7 5.2.1.8 stream usage notes..................................................................................... 5-7 5.2.1.9 stream implementation assumptions .......................................................... 5-9 5.2.2 prioritizing cache block replacement............................................................ 5-9 5.2.3 partially executed altivec instructions ......................................................... 5-10 5.3 dsi exception?ata address breakpoint........................................................ 5-10 5.4 altivec unavailable exception (0x00f20) ........................................................ 5-10 chapter 6 altivec instructions 6.1 instruction formats .............................................................................................. 6-1 6.1.1 instruction fields ............................................................................................. 6-1 6.1.2 notation and conventions................................................................................ 6-2 6.2 altivec instruction set......................................................................................... 6-8 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
contents paragraph number title page number motorola contents ix appendix a altivec instruction set listings a.1 instructions sorted by mnemonic in decimal format........................................ a-1 appendix b instructions sorted by mnemonic in binary format b.1 instructions sorted by mnemonic in binary format ...........................................b-1 appendix c instructions sorted by opcode c.1 instructions sorted by opcode in decimal format..............................................c-1 appendix d instructions sorted by opcode d.1 instructions sorted by opcode in binary format ............................................... d-1 appendix e instructions sorted by form e.1 instructions sorted by form.................................................................................e-1 appendix f instruction set legend f.1 instruction set legend ......................................................................................... f-1 appendix g users manual revision history g.1 revision history ................................................................................................. g-1 glossary index f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
contents paragraph number title page number x altivec programming environments manual motorola f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
figures figure number title page number motorola contents xi 1-1 overview of powerpc architecture with altivec technology .................................... 1-4 1-2 altivec top-level diagram ........................................................................................ 1-7 1-3 big-endian byte ordering for a vector register ........................................................ 1-8 1-4 bit ordering ................................................................................................................ 1-8 1-5 intraelement example, vaddsbs .................................................................................. 1-9 1-6 interelement example, vperm ..................................................................................... 1-9 2-1 programming model?ll registers .......................................................................... 2-2 2-2 altivec register set .................................................................................................... 2-3 2-3 vector registers (vrs)................................................................................................ 2-4 2-4 vector status and control register (vscr) ............................................................... 2-5 2-5 32-bit vscr moved to a 128-bit vector register....................................................... 2-5 2-6 vector save/restore register (vrsave) ................................................................... 2-7 2-7 condition register (cr) ............................................................................................. 2-8 2-8 machine state register (msr) ................................................................................... 2-9 2-9 machine status save/restore register 0 (srr0) ..................................................... 2-11 2-10 machine status save/restore register 0 (srr1) ..................................................... 2-11 3-1 big-endian mapping of a quad word ........................................................................ 3-3 3-2 little-endian mapping of a quad word ..................................................................... 3-4 3-3 little-endian mapping of quad word?lternate view ............................................ 3-4 3-4 quad word load with powerpc munged little-endian applied............................... 3-5 3-5 altivec little endian double-word swap.................................................................. 3-6 3-6 misaligned vector in big-endian mode...................................................................... 3-7 3-8 big-endian quad word alignment ............................................................................. 3-8 3-7 misaligned vector in little-endian addressing mode................................................ 3-8 3-9 little-endian alignment ........................................................................................... 3-11 4-1 register indirect with index addressing for loads/stores ....................................... 4-27 5-1 format of rb in dst instruction.................................................................................... 5-2 5-2 data stream touch ...................................................................................................... 5-3 5-3 srr1 bit settings after an altivec unavailable exception...................................... 5-11 6-1 format of rb in dst instruction (32-bit)..................................................................... 6-13 6-2 effects of example load/store instructions ............................................................. 6-15 6-3 load vector for shift left ......................................................................................... 6-18 6-4 instruction vperm used in aligning data ................................................................. 6-19 6-5 vaddcuw?etermine carries of four unsigned integer adds (32-bit) .................. 6-30 6-6 vaddfp?dd four floating-point elements (32-bit) .............................................. 6-31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
figures figure number title page number xii altivec programming environments manual motorola 6-7 vaddsbs?dd saturating sixteen signed integer elements (8-bit) ........................ 6-32 6-8 vaddshs?add saturating eight signed integer elements (16-bit) ......................... 6-33 6-9 vaddsws?dd saturating four signed integer elements (32-bit) .......................... 6-34 6-10 vaddubm?dd sixteen integer elements (8-bit) .................................................... 6-35 6-11 vaddubs?dd saturating sixteen unsigned integer elements (8-bit).................... 6-36 6-12 vadduhm?dd eight integer elements (16-bit) ..................................................... 6-37 6-13 vadduhs?dd saturating eight unsigned integer elements (16-bit) ..................... 6-38 6-14 vadduwm?dd four integer elements (32-bit)...................................................... 6-39 6-15 vadduws?dd saturating four unsigned integer elements (32-bit) ..................... 6-40 6-16 vand?ogical bitwise and .................................................................................... 6-41 6-17 vand?ogical bitwise and with complement ...................................................... 6-42 6-18 vavgsb?average sixteen signed integer elements (8-bit) .................................... 6-43 6-19 vavgsh?verage eight signed integer elements (16-bits)...................................... 6-44 6-20 vavgsw?average four signed integer elements (32-bit) ...................................... 6-45 6-21 vavgub?verage sixteen unsigned integer elements (8-bits)................................ 6-46 6-22 vavgsh?average eight signed integer elements (16-bit)...................................... 6-47 6-23 vavguw?verage four unsigned integer elements (32-bit) .................................. 6-48 6-24 vcfsx?onvert four signed integer elements to four floating-point elements (32-bit) ................................................................................................................. 6-49 6-25 vcfux?onvert four unsigned integer elements to four floating-point elements (32-bit) ................................................................................................................. 6-50 6-26 vcmpbfp?ompare bounds of four floating-point elements (32-bit).................. 6-52 6-27 vcmpeqfp?ompare equal of four floating-point elements (32-bit) ................... 6-53 6-28 vcmpequb?ompare equal of sixteen integer elements (8-bits)........................... 6-54 6-29 vcmpequh?ompare equal of eight integer elements (16-bit) ............................. 6-55 6-30 vcmpequw?ompare equal of four integer elements (32-bit) ............................. 6-56 6-31 vcmpgefp?ompare greater-than-or-equal of four floating-point elements (32-bit) ................................................................................................................. 6-57 6-32 vcmpgtfp?ompare greater-than of four floating-point elements (32-bit)........ 6-58 6-33 vcmpgtsb?ompare greater-than of sixteen signed integer elements (8-bit)..... 6-59 6-34 vcmpgtsh?ompare greater-than of eight signed integer elements (16-bit)...... 6-60 6-35 vcmpgtsw?ompare greater-than of four signed integer elements (32-bit) ...... 6-61 6-36 vcmpgtub?ompare greater-than of sixteen unsigned integer elements (8-bit) 6-62 6-37 vcmpgtuh?ompare greater-than of eight unsigned integer elements (16-bit) . 6-63 6-38 vcmpgtuw?ompare greater-than of four unsigned integer elements (32-bit) . 6-64 6-39 vctsxs?onvert four floating-point elements to four signed integer elements (32-bit) ................................................................................................................. 6-65 6-40 vctuxs?onvert four floating-point elements to four unsigned integer elements (32-bit) ................................................................................................................. 6-66 6-41 vexptefp? raised to the exponent estimate floating-point for four floating-point elements (32-bit) ................................................................................................. 6-68 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
figures figure number title page number motorola contents xiii 6-42 vexptefp?og2 estimate floating-point for four floating-point elements (32-bit) ................................................................................................................. 6-70 6-43 vmaddfp?ultiply-add four floating-point elements (32-bit)............................ 6-71 6-44 vmaxfp?aximum of four floating-point elements (32-bit) ............................... 6-72 6-45 vmaxsb?aximum of sixteen signed integer elements (8-bit) ............................ 6-73 6-46 vmaxsh?aximum of eight signed integer elements (16-bit) ............................. 6-74 6-47 vmaxsw?aximum of four signed integer elements (32-bit).............................. 6-75 6-48 vmaxub?aximum of sixteen unsigned integer elements (8-bit) ....................... 6-76 6-49 vmaxuh?aximum of eight unsigned integer elements (16-bit)......................... 6-77 6-50 vmaxuw?aximum of four unsigned integer elements (32-bit) ......................... 6-78 6-51 vmhaddshs?ultiply-high and add eight signed integer elements (16-bit) ....... 6-79 6-52 vmhraddshs?ultiply-high round and add eight signed integer elements (16-bit) ................................................................................................................. 6-80 6-53 vminfp?inimum of four floating-point elements (32-bit) ................................ 6-81 6-54 vminsb?inimum of sixteen signed integer elements (8-bit) ............................. 6-82 6-55 vminsh?inimum of eight signed integer elements (16-bit)............................... 6-83 6-56 vminsw?inimum of four signed integer elements (32-bit) ............................... 6-84 6-57 vminub?inimum of sixteen unsigned integer elements (8-bit)......................... 6-85 6-58 vminuh?inimum of eight unsigned integer elements (16-bit) .......................... 6-86 6-59 vminuw?inimum of four unsigned integer elements (32-bit) .......................... 6-87 6-60 vmladduhm?ultiply-add of eight integer elements (16-bit) ............................. 6-88 6-61 vmrghb?erge eight high-order elements (8-bit)............................................... 6-89 6-62 vmrghh?erge four high-order elements (16-bit) .............................................. 6-90 6-63 vmrghw?erge four high-order elements (32-bit) ............................................. 6-91 6-64 vmrglb?erge eight low-order elements (8-bit)................................................. 6-92 6-65 vmrglh?erge four low-order elements (16-bit)................................................ 6-93 6-66 vmrglw?erge four low-order elements (32-bit)............................................... 6-94 6-67 vmsummbm?ultiply-sum of integer elements (8-bit to 32-bit) ........................ 6-95 6-68 vmsumshm?ultiply-sum of signed integer elements (16-bit to 32-bit).................................................................................................. 6-96 6-69 vmsumshs?ultiply-sum of signed integer elements (16-bit to 32-bit).................................................................................................. 6-97 6-70 vmsumubm?ultiply-sum of unsigned integer elements (8-bit to 32-bit).................................................................................................... 6-98 6-71 vmsumuhm?ultiply-sum of unsigned integer elements (16-bit to 32-bit).................................................................................................. 6-99 6-72 vmsumuhs?ultiply-sum of unsigned integer elements (16-bit to 32-bit)................................................................................................ 6-100 6-73 vmulesb?ven multiply of eight signed integer elements (8-bit)...................... 6-101 6-74 vmulesb?ven multiply of four signed integer elements (16-bit) ..................... 6-102 6-75 vmuleub?ven multiply of eight unsigned integer elements (8-bit) ................. 6-103 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
figures figure number title page number xiv altivec programming environments manual motorola 6-76 vmuleuh?ven multiply of four unsigned integer elements (16-bit) ................ 6-104 6-77 vmulosb?dd multiply of eight signed integer elements (8-bit)....................... 6-105 6-78 vmuleuh?dd multiply of four unsigned integer elements (16-bit).................. 6-106 6-79 vmuloub?dd multiply of eight unsigned integer elements (8-bit) .................. 6-107 6-80 vmulouh?dd multiply of four unsigned integer elements (16-bit) ................. 6-108 6-81 vnmsubfp?egative multiply-subtract of four floating-point elements (32-bit)............................................................... 6-109 6-82 vnor?itwise nor of 128-bit vector.................................................................... 6-110 6-83 vor?itwise or of 128-bit vector......................................................................... 6-111 6-84 vperm?oncatenate sixteen integer elements (8-bit).......................................... 6-112 6-85 how a word is packed to a half word.................................................................... 6-113 6-86 vpkpx?ack eight elements (32-bit) to eight elements (16-bit) ........................ 6-114 6-87 vpkshss?ack sixteen signed integer elements (16-bit) to sixteen signed integer elements (8-bit) ................................................................................................. 6-115 6-88 vpkshus?ack sixteen signed integer elements (16-bit) to sixteen unsigned integer elements (8-bit) ................................................................................................. 6-116 6-89 vpkswss?ack eight signed integer elements (32-bit) to eight signed integer elements (16-bit) ............................................................................................... 6-117 6-90 vpkswus?ack eight signed integer elements (32-bit) to eight unsigned integer elements (16-bit) ............................................................................................... 6-118 6-91 vpkuhum?ack sixteen unsigned integer elements (16-bit) to sixteen unsigned integer elements (8-bit) ................................................... 6-119 6-92 vpkuhus?ack sixteen unsigned integer elements (16-bit) to sixteen unsigned integer elements (8-bit) ................................................... 6-120 6-93 vpkuwum?ack eight unsigned integer elements (32-bit) to eight unsigned integer elements (16-bit)..................................................... 6-121 6-94 vpkuwum?ack eight unsigned integer elements (32-bit) to eight unsigned integer elements (16-bit)..................................................... 6-122 6-95 vrefp?eciprocal estimate of four floating-point elements (32-bit) ................. 6-124 6-96 vr??round to minus in?ity of four floating-point integer elements (32-bit)................................................................................... 6-125 6-97 vr??earest round to nearest of four floating-point integer elements (32-bit)........................................................... 6-126 6-98 vr??ound to plus in?ity of four floating-point integer elements (32-bit)................................................................................... 6-127 6-99 vr??ound-to-zero of four floating-point integer elements (32-bit) .............. 6-128 6-100 vrlb?eft rotate of sixteen integer elements (8-bit)........................................... 6-129 6-101 vrlh?eft rotate of eight integer elements (16-bit) ............................................ 6-130 6-102 vrlw?eft rotate of four integer elements (32-bit) ............................................ 6-131 6-103 vrsqrtefp?eciprocal square root estimate of four floating-point elements (32-bit) ............................................................................................................... 6-132 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
figures figure number title page number motorola contents xv 6-104 vsel?itwise conditional select of vector contents(128-bit) .............................. 6-133 6-105 vsl?hift bits left in vector (128-bit) .................................................................. 6-134 6-106 vslb?hift bits left in sixteen integer elements (8-bit) ...................................... 6-135 6-107 vsldoi?hift left by bytes speci?d .................................................................... 6-136 6-108 vslh?hift bits left in eight integer elements (16-bit) ....................................... 6-137 6-109 vslo?eft byte shift of vector (128-bit)............................................................... 6-138 6-110 vslw?hift bits left in four integer elements (32-bit)........................................ 6-139 6-111 vspltb?opy contents to sixteen elements (8-bit) .............................................. 6-140 6-112 vsplth?opy contents to eight elements (16-bit)................................................ 6-141 6-113 vspltisb?opy value into sixteen signed integer elements (8-bit) ..................... 6-142 6-114 vspltish?opy value to eight signed integer elements (16-bit).......................... 6-143 6-115 vspltisw?opy value to four signed elements (32-bit) ...................................... 6-144 6-116 vspltw?opy contents to four elements (32-bit)................................................. 6-145 6-117 vsr?hift bits right for vectors (128-bit) ............................................................ 6-147 6-118 vsrab?hift bits right in sixteen integer elements (8-bit).................................. 6-148 6-119 vsrah?hift bits right for eight integer elements (16-bit).................................. 6-149 6-120 vsraw?hift bits right in four integer elements (32-bit) ................................... 6-150 6-121 vsrb?hift bits right in sixteen integer elements (8-bit).................................... 6-151 6-122 vsrh?hift bits right for eight integer elements (16-bit) ................................... 6-152 6-123 vsro?ector shift right octet ............................................................................... 6-153 6-124 vsrw?hift bits right in four integer elements (32-bit) ..................................... 6-154 6-125 vsubcuw?ubtract carryout of four unsigned integer elements (32-bit)........... 6-155 6-126 vsubfp?ubtract four floating point elements (32-bit) ...................................... 6-156 6-127 vsubsbs?ubtract sixteen signed integer elements (8-bit).................................. 6-157 6-128 vsubshs?ubtract eight signed integer elements (16-bit)................................... 6-158 6-129 vsubsws?ubtract four signed integer elements (32-bit) ................................... 6-159 6-130 vsububm?ubtract sixteen integer elements (8-bit)............................................ 6-160 6-131 vsububs?ubtract sixteen unsigned integer elements (8-bit) ............................. 6-161 6-132 vsubuhm?ubtract eight integer elements (16-bit) ............................................. 6-162 6-133 vsubuhs?ubtract eight signed integer elements (16-bit)................................... 6-163 6-134 vsubuwm?ubtract four integer elements (32-bit) ............................................. 6-164 6-135 vsubuws?ubtract four signed integer elements (32-bit)................................... 6-165 6-136 vsumsws?um four signed integer elements (32-bit) ........................................ 6-166 6-137 vsum2sws?wo sums in the four signed integer elements (32-bit)................... 6-167 6-138 vsum4sbs?our sums in the integer elements (32-bit) ....................................... 6-168 6-139 vsum4shs?our sums in the integer elements (32-bit) ....................................... 6-169 6-140 vsum4ubs?our sums in the integer elements (32-bit) ....................................... 6-170 6-141 vupkhpx?npack high-order elements (16 bit) to elements (32-bit) ................ 6-171 6-142 vupkhsb?npack high-order signed integer elements (8-bit) to signed integer elements (16-bit) ............................................................................................... 6-172 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
figures figure number title page number xvi altivec programming environments manual motorola 6-143 vupkhsh?npack signed integer elements (16-bit) to signed integer elements (32-bit) ............................................................................................................... 6-173 6-144 vupklpx?npack low-order elements (16-bit) to elements (32-bit) ................. 6-174 6-145 vupklsb?npack low-order elements (8-bit) to elements (16-bit) ................... 6-175 6-146 vupklsh?npack low-order signed integer elements (16-bit) to signed integer elements (32-bit) ............................................................................................... 6-176 6-147 vxor?itwise xor (128-bit)................................................................................ 6-177 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
tables table number title page number motorola tables xvii i acronyms and abbreviated terms ............................................................................ xxiv ii terminology conventions ....................................................................................... xxvii iii instruction field conventions ................................................................................. xxvii 2-1 vscr field descriptions............................................................................................ 2-5 2-2 vrsave bit settings ................................................................................................. 2-7 2-3 cr6 fields bit settings for vector compare instructions ......................................... 2-8 2-4 msr bit settings ...................................................................................................... 2-10 3-1 memory operand alignment ...................................................................................... 3-2 3-2 effective address modi?ations ................................................................................. 3-5 4-1 vector integer arithmetic instructions ........................................................................ 4-6 4-2 cr6 field bit settings for vector integer compare instructions.............................. 4-13 4-3 vector integer compare instructions ........................................................................ 4-14 4-4 vector integer logical instructions........................................................................... 4-16 4-5 vector integer rotate instructions............................................................................. 4-16 4-6 vector integer shift instructions ............................................................................... 4-17 4-7 floating-point arithmetic instructions...................................................................... 4-19 4-8 floating-point multiply-add instructions ................................................................ 4-21 4-9 floating-point rounding and conversion instructions ............................................. 4-21 4-10 common mathematical predicates ........................................................................... 4-23 4-11 other useful predicates ............................................................................................ 4-23 4-12 floating-point compare instructions ........................................................................ 4-24 4-13 floating-point estimate instructions......................................................................... 4-25 4-14 effective address alignment..................................................................................... 4-26 4-15 integer load instructions .......................................................................................... 4-28 4-16 vector load instructions supporting alignment ...................................................... 4-29 4-17 shift values for lvsl instruction................................................................................. 4-29 4-18 shift values for lvsr instruction ................................................................................ 4-29 4-19 integer store instructions .......................................................................................... 4-30 4-20 vector pack instructions............................................................................................ 4-32 4-21 vector unpack instructions ....................................................................................... 4-34 4-22 vector merge instructions ......................................................................................... 4-35 4-23 vector splat instructions ........................................................................................... 4-36 4-24 vector permute instruction........................................................................................ 4-36 4-25 vector select instruction ........................................................................................... 4-37 4-26 vector shift instructions............................................................................................ 4-37 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
tables table number title page number xviii altivec technology programming environments manual motorola 4-27 coding various shifts and rotates with the vsidoi instruction................................. 4-38 4-28 move to/from condition register instructions ......................................................... 4-40 4-29 simpli?d mnemonics for data stream touch ( dst ) ................................................ 4-40 4-30 user-level cache instructions .................................................................................. 4-42 5-1 altivec unavailable exception?egister settings .................................................. 5-11 5-2 exception priorities (synchronous/precise exceptions)........................................... 5-12 6-1 instruction syntax conventions .................................................................................. 6-2 6-2 notation and conventions ........................................................................................... 6-2 6-3 instruction field conventions ..................................................................................... 6-7 6-4 precedence rules ........................................................................................................ 6-7 6-5 special values of the element in vb ......................................................................... 6-67 6-6 special values of the element in vb ......................................................................... 6-69 6-7 special values of the element in vb ....................................................................... 6-123 6-8 special values of the element in vb ....................................................................... 6-132 a-1 instruction sorted by mnemonic in decimal format ................................................ a-1 b-1 instructions sorted by mnemonic in binary format...................................................b-1 c-1 instructions sorted by opcode in decimal format.....................................................c-1 d-1 instructions sorted by opcode in binary format ...................................................... d-1 e-1 va-form......................................................................................................................e-1 e-2 vx-form .....................................................................................................................e-2 e-3 x-form........................................................................................................................e-5 e-4 vxr-form ..................................................................................................................e-6 f-1 altivec instruction set legend ................................................................................... f-1 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola preface xix about this book the primary objective of this manual is to help programmers provide software that is compatible with processors that implement the powerpc architecture and the altivec technology. this book describes how the altivec technology relates to the 32-bit portions of the powerpc architecture. to locate any published errata or updates for this document, refer to the web at http://www.motorola.com/semiconductors. this book is one of two that discuss the altivec technology. the two books are as follows. altivec technology programming interface manual (altivec pim) is a reference guide for high-level programmers. the altivec pim describes how programmers can access altivec functionality from programming languages such as c and c++. the altivec pim de?es a programming model for use with the altivec instruction set. processor that implement the powerpc architecture use the altivec instruction set as an extension of the powerpc instruction set. altivec technology programming environments manual (altivec pem) is used as a reference guide for assembler programmers. the altivec pem uses a standardized format instruction to describe each instruction, showing syntax, instruction format, register translation language (rtl) code that describes how the instruction works, and a listing of which, if any, registers are affected. at the bottom of each instruction entry is a ?ure that shows the operations on elements within source operands and where the results of those operations are placed in the destination operand. because it is important to distinguish between the levels of the powerpc architecture to ensure compatibility across multiple platforms, those distinctions are shown clearly throughout this book. this document stays consistent with the powerpc architecture in referring to three levels, or programming environments, which are as follows: powerpc user instruction set architecture (uisa)?he uisa de?es the level of the architecture to which user-level software should conform. he uisa de?es the base user-level instruction set, user-level registers, data types, memory conventions, and the memory and programming models seen by application programmers. powerpc virtual environment architecture (vea)?he vea, which is the smallest component of the powerpc architecture, de?es additional user-level functionality that falls outside typical user-level software requirements. the vea describes the memory model for an environment in which multiple processors or other devices can u v f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
xx altivec technology programming environments manual motorola access external memory and de?es aspects of the cache model and cache control instructions from a user-level perspective. vea resources are particularly useful for optimizing memory accesses and for managing resources in an environment in which other processors and other devices can access external memory. implementations that conform to the vea also conform to the uisa but may not necessarily adhere to the oea. powerpc operating environment architecture (oea)?he oea de?es supervisor-level resources typically required by an operating system. it de?es the memory management model, supervisor-level registers, and the exception model. implementations that conform to the oea also conform to the uisa and vea. most of the discussions on the altivec technology are at the uisa level. the level of the architecture to which text refers is indicated in the outer margin, using the conventions shown in section , ?onventions,?on page -xxiii. for ease in reference, this book and the processor users manuals have arranged the architecture information into topics that build upon one another, beginning with a description and complete summary of registers and instructions (for all three environments) and progressing to more specialized topics such as the cache, exception, and memory management models. as such, chapters may include information from multiple levels of the architecture, but when discussing oea and vea, the level is noted in the text. it is beyond the scope of this manual to describe individual altivec technology implementations on processors that implement the powerpc architecture. it must be kept in mind that each processor that implements the powerpc architecture and altivec technology is unique in its implementation. the information in this book is subject to change without notice, as described in the disclaimers on the title page of this book. as with any technical documentation, it is the readers responsibility to be sure they are using the most recent version of the documentation. for more information, contact your sales representative or visit our web site at http://www.mot.com/semiconductors. audience this manual is intended for system software and hardware developers and application programmers who want to develop products using the altivec technology extension to the powerpc architecture. it is assumed that the reader understands operating systems, microprocessor system design, and the basic principles of risc processing and details of the powerpc architecture. this book describes how the altivec technology interacts with the 32-bit portions of the powerpc architecture o f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola preface xxi organization following is a summary and a brief description of the major sections of this manual: chapter 1, ?verview,?is useful for those who want a general understanding of the features and functions of the altivec technology. this chapter provides an overview of how the altivec technology de?es the register set, operand conventions, addressing modes, instruction set, cache model, and exception model. chapter 2, altivec register set,?is useful for software engineers who need to understand the powerpc programming model for the three programming environments. the chapter also discusses the functionality of the altivec technology registers and how they interact with the other powerpc registers. chapter 3, ?perand conventions,?describes how the altivec technology interacts with the powerpc conventions for storing data in memory, including information regarding alignment, single-precision ?ating-point conventions, and big- and little-endian byte ordering. chapter 4, addressing modes and instruction set summary,?provides an overview of the altivec technology addressing modes and a brief description of the altivec technology instructions organized by function. chapter 5, ?ache, exceptions, and memory management,?provides a discussion of the cache and memory model defined by the vea and aspects of the cache model that are defined by the oea. it also describes the exception model de?ed in the uisa. chapter 6, altivec instructions,?functions as a handbook for the altivec instruction set. instructions are sorted by mnemonic. each instruction description includes the instruction formats and ?ures where it helps in understanding what the instruction does. appendices a, b, c, d, e, f, and g list all of the altivec instructions, grouped according to mnemonic, opcode, and form, in both decimal and binary order. appendix g, ?sers manual revision history,?describes changes since the previous revision of this document. this manual also includes a glossary and an index. suggested reading this section lists additional reading that provides background for the information in this manual as well as general information about the altivec technology and powerpc architecture. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
xxii altivec technology programming environments manual motorola general information the following documentation, available through morgan-kaufmann publishers, 340 pine street, sixth floor, san francisco, ca, provides useful information about the powerpc architecture and computer architecture in general: the powerpc architecture: a speci?ation for a new family of risc processors , second edition, by international business machines, inc. for updates to the speci?ation, see http://www.austin.ibm.com/tech/ppc-chg.html. powerpc microprocessor common hardware reference platform: a system architecture , by apple computer, inc., international business machines, inc., and motorola, inc. computer architecture: a quantitative approach , second edition, by john l. hennessy and david a. patterson computer organization and design: the hardware/software interface , second edition, david a. patterson and john l. hennessy related documentation motorola documentation is available from the sources listed on the back cover of this manual; the document order numbers are included in parentheses for ease in ordering: programming environments manual for 32-bit implementations of the powerpc architecture (programming environments manual)?escribes resources de?ed by the powerpc architecture (documentation order number: mpcfp32b/ad). users manuals?hese books provide details about individual implementations and are intended for use with the programming environments manual. addenda/errata to users manuals?ecause some processors have follow-on parts an addendum is provided that describes the additional features and functionality changes. these addenda are intended for use with the corresponding users manuals. hardware speci?ations?ardware speci?ations provide speci? data regarding bus timing, signal behavior, and ac, dc, and thermal characteristics, as well as other design considerations. technical summaries?ach device has a technical summary that provides an overview of its features. this document is roughly the equivalent to the overview (chapter 1) of an implementations users manual. application notes?hese short documents address speci? design issues useful to programmers and engineers working with motorola processors. additional literature is published as new processors become available. for a current list of documentation, refer to http://www.motorola.com/semiconductors. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola preface xxiii conventions this document uses the following notational conventions: cleared/set when a bit takes the value zero, it is said to be cleared; when it takes a value of one, it is said to be set. mnemonics instruction mnemonics are shown in lowercase bold. italics italics indicate variable command parameters, for example, bcctr x. book titles in text are set in italics 0x0 pre? to denote hexadecimal number 0b0 pre? to denote binary number r a, r b instruction syntax used to identify a source general-purpose register (gpr) r d instruction syntax used to identify a destination gpr fr a, fr b, fr c instruction syntax used to identify a source ?ating-point register (fpr) fr d instruction syntax used to identify a destination fpr reg[field] abbreviations for registers are shown in uppercase text. speci? bits, ?lds, or ranges appear in brackets. for example, msr[le] refers to the little-endian mode enable bit in the machine state register. v a, v b, v c instruction syntax used to identify a source vector register (vr) v d instruction syntax used to identify a destination vr x in some contexts, such as signal encodings, an unitalicized x indicates a dont care. x an italicized x indicates an alphanumeric variable. n an italicized n indicates an numeric variable. not logical operator & and logical operator | or logical operator this symbol identi?s text that is relevant with respect to the powerpc user instruction set architecture (uisa). this symbol is used both for information that can be found in the uisa speci?ation as well as for explanatory information related to that programming environment. u f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
xxiv altivec technology programming environments manual motorola this symbol identi?s text that is relevant with respect to the powerpc virtual environment architecture (vea). this symbol is used both for information that can be found in the vea speci?ation as well as for explanatory information related to that programming environment. this symbol identi?s text that is relevant with respect to the powerpc operating environment architecture (oea). this symbol is used both for information that can be found in the oea speci?ation as well as for explanatory information related to that programming environment. indicates functionality de?ed by the altivec technology. indicates reserved bits or bit ?lds in a register. although these bits may be written to as ones or zeros, they are always read as zeros. additional conventions used with instruction encodings are described in section 6.1, ?nstruction formats. acronyms and abbreviations table i contains acronyms and abbreviations that are used in this document. note that the meanings for some acronyms (such as sdr1 and xer) are historical, and the words for which an acronym stands may not be intuitively obvious. table i. acronyms and abbreviated terms term meaning altivec pem altivec technology programming environments manual altivec pim altivec technology programming interface manual alu arithmetic logic unit bat block address translation cr condition register ctr count register dabr data address breakpoint register dar data address register dbat data bat dec decrementer register dsisr register used for determining the source of a dsi exception ea effective address ecc error checking and correction v o 0 0 0 0 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola preface xxv fpr floating-point register fpscr floating-point status and control register fpu floating-point unit gpr general-purpose register iabr instruction address breakpoint register ibat instruction bat ieee institute of electrical and electronics engineers itlb instruction translation lookaside buffer iu integer unit l2 secondary cache l3 level 3 cache lifo last-in-?st-out lr link register lru least recently used lsb least-signi?ant byte lsb least-signi?ant bit lsu load/store unit lsq least-signi?ant quad-word lsq least-signi?ant quad-word mesi modi?d/exclusive/shared/invalid?ache coherency protocol mmcr n monitor mode control registers mmu memory management unit msb most-signi?ant byte msb most-signi?ant bit msq most-signi?ant quad-word msq most-signi?ant quad-word msr machine state register nan not a number nia next instruction address no-op no operation oea operating environment architecture pem programming environments manual for 32-bit implementations of the powerpc architecture pmc n performance monitor counter registers pte page table entry table i. acronyms and abbreviated terms (continued) term meaning f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
xxvi altivec technology programming environments manual motorola pteg page table entry group pvr processor version register risc reduced instruction set computing rtl register transfer language rwitm read with intent to modify rwnitm read with no intent to modify sda sampled data address register sdr1 register that speci?s the page table base address for virtual-to-physical address translation sia sampled instruction address register simm signed immediate value spr special-purpose register sr n segment register srr0 machine status save/restore register 0 srr1 machine status save/restore register 1 ste segment table entry tb time base facility tbl time base lower register tbu time base upper register tlb translation lookaside buffer uimm unsigned immediate value uisa user instruction set architecture ummcr n user monitor mode control registers upmc n user performance monitor counter registers va virtual address vea virtual environment architecture vpu vector permute unit vr vector register vscr vector status and control register vtq vector touch queue xer register used for indicating conditions such as carries and over?ws for integer operations table i. acronyms and abbreviated terms (continued) term meaning f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola preface xxvii terminology conventions table ii lists certain terms used in this manual that differ from the architecture terminology conventions. table iii describes instruction ?ld notation conventions used in this manual. table ii. terminology conventions the architecture speci?ation this manual data storage interrupt (dsi) dsi exception extended mnemonics simpli?d mnemonics fixed-point unit (fxu) integer unit (iu) instruction storage interrupt (isi) isi exception interrupt exception privileged mode (or privileged state) supervisor-level privilege problem mode (or problem state) user-level privilege real address physical address relocation translation storage (locations) memory storage (the act of) access store in write back store through write through table iii. instruction field conventions the architecture speci?ation equivalent to: ba, bb, bt crb a, crb b, crb d (respectively) bf, bfa crf d, crf s (respectively) dd ds ds flm fm fra, frb, frc, frt, frs fr a, fr b, fr c, fr d, fr s (respectively) fxm crm ra, rb, rt, rs r a, r b, r d, r s (respectively) si simm u imm ui uimm va, vb, vt, vs v a, v b, v d, v s (respectively) f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
xxviii altivec technology programming environments manual motorola vec altivec technology /, //, /// 0...0 (shaded) table iii. instruction field conventions (continued) the architecture speci?ation equivalent to: f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 1. overview 1-1 chapter 1 overview this chapter provide an overview of altivec technology, including general concepts which helps in understanding the features that altivec technology provides. there is also information on how altivec technology works with powerpc architecture. 1.1 overview altivec technology provides a software model that accelerates the performance of various software applications as it runs on reduced instruction set computing (risc) microprocessors. altivec technology extends the instruction set architecture (isa) of powerpc architecture. altivec isa is based on separate vector/simd-style (single instruction stream, multiple data streams) execution units that have high data parallelism. that is, altivec technology operates on multiple data items in a single instruction which allows for a highly ef?ient way to process large quantities of information. high degrees of parallelism are achievable with simple in-order instruction dispatch and low-instruction time processing. however, the isa is designed so as not to impede additional parallelism through dispatch to multiple execution units or multithreaded execution unit pipelines. altivec technology is an architecture that de?es a set of registers and execution units which can be used in conjunction with the powerpc architecture. all instructions are designed to be easily pipelined with pipeline latencies no greater than the scalar, double-precision, ?ating-point multiply-add. there are no operating mode switches which make interleaving of instructions with the existing ?ating-point and integer instructions possible. the vector unit minimizes exceptions and has few shared resources. this requires it to be tightly synchronized with other execution units that prevent delays in executing instructions. altivec technologys simd-style extension provides an approach to accelerating the processing of data streams. that is, in simd parallel processing, the vector unit will fetch and interpret instructions and process multiple pieces of data simultaneously. by processing whole streams of data at once, it provides a fast and ef?ient was to manipulate large quantities of information. altivec instructions provide a signi?ant speedup for communications, multimedia, and other performance-driven applications by using the data-level parallelism and keeping processing of data to the vector register ?e. by having separate register ?es, the execution units data accesses by different register ?es can be f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
1-2 altivec technology programming enviroments manual motorola overview done concurrently. the data stream engine in altivec supports data-intensive prefetching, minimizing latency in memory access bottlenecks. by using the simd parallelism in altivec technology, performance can be accelerated on processors that implement the powerpc architecture to a level that allows real-time processing of one or more data streams at the same time. a majority of audio and visual applications require no more that 8- or 16-bit data types to represent satisfactory color and sound. altivec isa can help accelerate the processing of the following types of applications: voice over ip (voip). voip transmits voice as compressed digital data packets over the internet. access concentrators/dslams. an access concentrator strips data traf? off pots lines and inserts it onto the internet. digital subscriber loop access multiplexer (dslam) pulls data off at a switch and immediately routes it to the internet. this allows it to concentrate adsl digital traf? at the switch and off-load the network. speech recognition. speech processing allows voice recognition for use in applications such as directory assistance and automatic dialing. voice/sound processing (audio encode and decode): voice processing uses signal processing to improve sound quality on lines. communications: multi-channel modems modem banks can use altivec technology to replace signal processors in dsp farms. 2d and 3d graphics: arcade-type games image and video processing: jpeg, ?ters echo cancellation. echo cancellation is used to eliminate echo on long delay calls (250?00 milliseconds, as in satellite communications). array number processing basestation processing: cellular basestation compresses digital voice data for transmission within the internet. video conferencing: h.261, h.263 in this document, the term ?mplementation refers to a hardware device (typically a microprocessor) that complies with powerpc architecture. altivec technology can be used as an extension to various risc microprocessors; however, in this book it is discussed within the context of powerpc architecture, described as follows: f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 1. overview 1-3 altivec technology overview programming model instruction set. the altivec instruction set speci?s instructions that extend the powerpc instruction set. these instructions are organized similar to powerpc instructions (vector integer, vector ?ating-point, vector load/store, and vector permutation and formatting instructions). the speci? instructions, and the forms used for encoding them, are provided in appendix a, ?nstruction set. register set. the altivec programming model de?es new altivec registers, additions to the powerpc register set, and how existing powerpc registers are affected by the altivec technology. the model also addresses memory conventions including details regarding the byte ordering for quad words. memory model. altivec technology speci?s additional cache management instructions. that is, altivec instructions can control software-directed data prefetching. exception model. altivec technology provides very few exceptions, so processing is ef?ient. among the few exceptions are an altivec unavailable (vui) exception and a dsi exception. memory management model. the memory model for altivec technology is the same as for powerpc architecture. altivec memory accesses are always assumed to be aligned. if an operand is misaligned, additional altivec instructions can be used to ensure that the operand is placed correctly in the vector register. time-keeping model. the powerpc time-keeping model is not affected by altivec technology. to locate published errata or updates for this document, refer to the website at http://www.motorola.com/semiconductors. 1.2 altivec technology overview altivec technology expands powerpc architecture through the addition of a 128-bit vector execution unit, which operates concurrently with the existing integer- and ?ating-point units. the dispatch unit can issue more than one instruction at a time so there is no penalty for mingling different types of instructions. a new vector execution unit can provide both a vector permute unit (vperm) and vector arithmetic logical unit (valu). by having a separate permute unit, data reorganization instructions can proceed concurrently with arithmetic instructions. altivec technology can be thought of as a set of registers and execution units that can be added to powerpc architecture in a manner analogous to the addition of ?ating-point units. floating-point units were added to provide support for high-precision scienti? calculations, and altivec technology is added to powerpc architecture to accelerate the next level of performance-driven, high-bandwidth communications and computing f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
1-4 altivec technology programming enviroments manual motorola altivec technology overview applications. figure 1-1 provides a high-level overview of the powerpc architecture with the altivec technology. . figure 1-1. overview of powerpc architecture with altivec technology altivec technology is purposefully simple so that there are minimal exceptions, no hardware misaligned access support, and no complex functions. altivec technology is scaled down to the necessary pieces only, in order to facilitate ef?ient cycle time, latency, and throughput on hardware implementations. altivec technology de?es the following: fixed 128-bit-wide vector length that can be subdivided into sixteen 8-bit bytes, eight 16-bit half words, or four 32-bit words vector register ?e (vrf) architecturally separate from ?ating-point registers (fprs) and general-purpose registers (gprs) vector integer and ?ating-point arithmetic four operands for most instructions (three source operands and one result) saturation clamping (that is, unsigned results are clamped to zero on under?w and to the maximum positive integer value (2 n -1, for example, 255 for byte ?lds) on over?w. for signed results, saturation clamps results to the smallest representable negative number (-2 n-1 , for example, -128 for byte ?lds) on under?w, and to the largest representable positive number (2 n-1 -1, for example, +127 for byte ?lds) on over?w) dispatch unit integer floating-point vector unit/s unit vrs inst inst i nst cache / memory unit fprs (32 bits) (64 bits) (128 bits) instruction stream gprs f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 1. overview 1-5 altivec technology overview operations selected based on utility to digital signal processing algorithms (including 3d). altivec instructions provide a vector compare and select mechanism to implement conditional execution as the preferred way to control data ?w in altivec programs. instructions that enhance the cache/memory interface 1.2.1 levels of altivec isa altivec isa follows the layering of powerpc architecture. powerpc architecture has three levels, de?ed as follows: jser instruction set architecture (uisa) ?he uisa de?es the level of the architecture to which user-level (referred to as problem state in the architecture speci?ation) software should conform. the uisa de?es the base user-level instruction set, user-level registers, data types, ?ating-point memory conventions, and exception model as seen by user programs, and the memory and programming models. the icon shown in the margin identi?s text that is relevant to the uisa. virtual environment architecture (vea)?he vea de?es additional user-level functionality that falls outside typical user-level software requirements. the vea describes the memory model for an environment in which multiple devices can access memory, de?es aspects of the cache model, de?es cache control instructions, and de?es the time base facility from a user-level perspective. the icon shown in the margin identi?s text that is relevant to the vea. implementations that conform to the vea also adhere to the uisa, but may not necessarily adhere to the oea. operating environment architecture (oea)?he oea de?es supervisor-level (referred to as privileged state in the architecture speci?ation) resources typically required by an operating system. the oea de?es the memory management model, supervisor-level registers, synchronization requirements, and the exception model. the oea also de?es the time base feature from a supervisor-level perspective. the icon shown in the margin identi?s text that is relevant to the oea. implementations that conform to the oea also conform to the uisa and vea. altivec technology de?es instructions at the uisa and vea levels. there are no altivec instructions de?ed at the oea level. the distinctions between the levels are noted in the text throughout the document this book describes the 32-bit powerpc architecture mode. and instructions are described from a 32-bit perspective. u v o f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
1-6 altivec technology programming enviroments manual motorola altivec architectural model 1.2.2 features not de?ed by altivec isa because ?xibility is an important design goal of altivec technology, there are many aspects of the microprocessor design, typically relating to the hardware implementation, that altivec isa does not de?e. for example, the number and the nature of execution units are not de?ed. altivec isa is a vector/simd architecture, and as such makes it easier to implement pipelining instructions and parallel execution units to maximize instruction throughput. however, altivec isa does not de?e the internal hardware details of implementations. for example, one processor may use a simple implementation having two vector execution units, whereas another may provide a bigger, faster microprocessor design with several concurrently pipelined vector arithmetic logical units (alus) with separate load/store units (lsus) and prefetch units. 1.3 altivec architectural model this section provides overviews of aspects de?ed by altivec isa, following the same order as the rest of this book. the topics are as follows: registers and programming model operand conventions addressing modes and instruction set cache, exceptions, and memory management models 1.3.1 altivec registers and programming model in altivec technology, the alu operates on from one to three source vectors and produces a single destination vector on each instruction. the alu is a simd-style arithmetic unit that performs the same operation on all the data elements comprising each vector . this scheme allows ef?ient code scheduling in a highly parallel processor. load and store instructions are the only instructions that transfer data between registers and memory. the vector unit and vector register ?e are shown in figure 1-2. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 1. overview 1-7 altivec architectural model figure 1-2. altivec top-level diagram the vector unit is a simd-style unit in which an instruction performs operations in parallel with the data elements that comprise each vector. architecturally, the vector register ?e (vrf) is separate from the gprs and fprs. the altivec programming model incorporates the 32 registers of the vrfs; each register is 128 bits wide . 1.3.2 operand conventions operand conventions de?e how data is stored in vector registers and memory. 1.3.2.1 byte ordering the default mapping for altivec isa is powerpc big-endian, but altivec isa provides the option of operating in either big- or little-endian mode. the endian support of powerpc architecture does not address any data element larger than a double word; the basic memory unit for vectors is a quad word. vector register file (vrf) 128 vr0 vr1 vr2 vr30 vr31 128 128 128 result/destination vector register vector unit f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
1-8 altivec technology programming enviroments manual motorola altivec architectural model big-endian byte ordering is shown in figure 1-3 . as shown in figure 1-3, the elements in vector registers are numbered using big-endian byte ordering. for example, the high-order (or most signi?ant) byte element is numbered 0 and the low-order (or least signi?ant) byte element is numbered 15. when de?ing high order and low order for elements in a vector register, be careful not to confuse its meaning based on the bit numbering. that is, in figure 1-4, the high-order half word for word 0 (bits 0?1) would be half word 0 (bits 0?5), and the low-order half word for word 0 would be half word 1 (bits 16?1). in big-endian mode, an altivec quad word load instruction for which the effective address (ea) is quad-word aligned places the byte addressed by ea into byte element 0 of the target vector register. the byte addressed by ea + 1 is placed in byte element 1, and so forth. similarly, an altivec quad word store instruction for which the ea is quad word-aligned places byte element 0 of the source vector register into the byte addressed by ea. byte element 1 is placed into the byte addressed by ea + 1, and so forth. 1.3.2.2 floating-point conventions altivec isa basically has two modes for ?ating-point, that is a java-/ieee-/c9x-compliant mode or a possibly faster non-java/non-ieee mode. altivec isa conforms to the java language speci?ation 1 (hereafter referred to as java), that is a subset of the default environment speci?d by the ieee standard (ansi/ieee standard 754-1985, ieee standard for binary floating-point arithmetic). for aspects of ?ating-point behavior that are not de?ed by java but are de?ed by the ieee standard, altivec isa conforms to the ieee standard. for aspects of ?ating-point behavior that are de?ed neither by java nor by the ieee standard but are de?ed by the c9x floating-point quad word word 0 word 1 word 2 word 3 half word 0 half word 1 half word 2 half word 3 half word 4 half word 5 half word 6 half word 7 byte 0 byte 1 byte 2 byte 3 byte 4 byte 5 byte 6 byte 7 byte 8 byte 9 byte 10 byte 11 byte 12 byte 13 byte 14 byte 15 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 127 msb (high order) lsb (low order) figure 1-3. big-endian byte ordering for a vector register word 0 high-order half word low-order half word 0 15 16 31 figure 1-4. bit ordering f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 1. overview 1-9 altivec architectural model proposal wg14/n546 x3j11/96-010 (draft 2/26/96) (hereafter referred to as c9x), altivec isa conforms to c9x when in java-compliant mode. 1.3.3 altivec addressing modes as with powerpc instructions, altivec instructions are encoded as single-word (32-bit) instructions. instruction formats are consistent among all instruction types, permitting decoding to be parallel with operand accesses. this ?ed instruction length and consistent format simpli?s instruction pipelining. altivec load, store, and stream prefetch instructions use secondary opcodes in primary opcode 31 (0b011111). altivec alu-type instructions use primary opcode 4 (0b000100). altivec isa supports both intraelement and interelement operations. in an intraelement operation, elements work in parallel with the corresponding elements from multiple source operand registers and place the results in the corresponding ?lds in the destination operand register. an example of an intraelement operation is the vector add signed word saturate ( vaddsws) instruction shown in figure 1-5 figure 1-5. intraelement example, vaddsbs in this example, the sixteen elements (8 bits per element) in register v a are added to the corresponding sixteen elements (8 bits per element) in register v b and the sixteen results are placed in the corresponding elements in register v d. in interelement operations data paths cross over. that is, different elements from each source operand are used in the resulting destination operand. an example of an interelement operation is the vector permute (vperm ) instruction shown in figure 1-6. figure 1-6. interelement example, vperm 0 element 123456789101112131415 + + + + + + + + + + + + + + + + v a v b v d 0 element 123456789101112131415 v c 1 1418101615191a1c1c1c13 8 1d1b e v a v b v d 0123456789abc d ef 10 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1f 11 1e f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
1-10 altivec technology programming enviroments manual motorola altivec architectural model in this example, vperm allows any byte in the two source vector registers ( v a and v b) to be copied to any byte in the destination vector register, v d. the bytes in a third source vector register ( v c) specify from which byte in the ?st two source vector registers the corresponding target byte is to be copied. so in the interelement example, the elements from the source vector registers do not have corresponding elements that operate on the destination register. most arithmetic and logical instructions are intraelement operations. the crossover data paths have been restricted as much as possible to the interelement manipulation instructions (unpack, pack, permute, etc.) with the idea to implement the alu and shift/permute as separate execution units. the following list of instructions distinguishes between interelement and intraelement instructions: vector intraelement instructions vector integer instructions vector integer arithmetic instructions vector integer compare instructions vector integer rotate and shift instructions vector ?ating-point instructions vector ?ating-point arithmetic instructions vector ?ating-point rounding and conversion instructions vector ?ating-point compare instruction vector ?ating-point estimate instructions vector memory access instructions vector interelement instructions vector alignment support instructions vector permutation and formatting instructions vector pack instructions vector unpack instructions vector merge instructions vector splat instructions vector permute instructions vector shift left/right instructions f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 1. overview 1-11 altivec architectural model 1.3.4 altivec instruction set although these categories are not de?ed by altivec isa, altivec instructions can be grouped as follows: vector integer arithmetic instructions?hese instructions are de?ed by the uisa. they include computational, logical, rotate, and shift instructions. vector integer arithmetic instructions vector integer compare instructions vector integer logical instructions vector integer rotate and shift instructions vector ?ating-point arithmetic instructions?hese include ?ating-point arithmetic instructions de?ed by the uisa. vector ?ating-point arithmetic instructions vector ?ating-point multiply/add instructions vector ?ating-point rounding and conversion instructions vector ?ating-point compare instruction vector ?ating-point estimate instructions vector load and store instructions?hese include load and store instructions for vector registers de?ed by the uisa. vector permutation and formatting instructions?hese instructions are de?ed by the uisa. vector pack instructions vector unpack instructions vector merge instructions vector splat instructions vector permute instructions vector select instructions vector shift instructions processor control instructions?hese instructions are used to read and write from the altivec status and control register (vscr). these instructions are de?ed by the uisa. memory control instructions?hese instructions are used for managing of caches (user level and supervisor level). the instructions are de?ed by vea and include data stream instructions. u u u u u v f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
1-12 altivec technology programming enviroments manual motorola altivec architectural model 1.3.5 altivec cache model altivec isa de?es several instructions for enhancements to cache management. these instructions allow software to indicate to the cache hardware how it should prefetch and prioritize writeback of data. the altivec isa does not de?e hardware aspects of cache implementations. 1.3.6 altivec exception model altivec vector instructions generate very few exceptions. data stream instructions will never cause an exception themselves. vector load and store instructions that attempt to access a direct-store segment will cause a dsi exception. the altivec unit does not report ieee exceptions; there are no status ?gs and the unit has no architecturally visible traps. default results are produced for all exception conditions as speci?d ?st by the java speci?ation. if no default exists, the ieee standards default is used. then, if no default exists, the c9x default is used. exceptions have been minimized so that the vector unit does not have to be tightly synchronized with the existing ?ating-point and integer units. by simplifying the communications path with other units there can be ?e grain interleaving of instructions that increases the instruction through-put. 1.3.7 memory management model in a processor that implement the powerpc architecture the mmus primary functions are to translate logical (effective) addresses to physical addresses for memory accesses and i/o accesses (most i/o accesses are assumed to be memory-mapped) and to provide access protection on a block or page basis. some protection is also available even if translation is disabled. typically, it is not programmable. the altivec isa does not provide any additional instructions to the powerpc memory management model, but altivec instructions have options to ensure that an operand is correctly placed in a vector register or in memory. u v f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 2. altivec register set 2-1 chapter 2 altivec register set this chapter describes the register organization de?ed by altivec technology. it also describes how altivec instructions affect some of the registers in the powerpc architecture. altivec instruction set architecture (isa) de?es register-to-register operations for all computational instructions. source data for these instructions is accessed from the on-chip vector registers (vrs) or are provided as immediate values embedded in the opcode. architecturally, the vrs are separate from the general-purpose registers (gprs) and ?ating-point registers (fprs). data is transferred between memory and vector registers with explicit altivec load and store instructions only. note that the handling of reserved bits in any register is implementation-dependent. software is permitted to write any value to a reserved bit in a register. however, a subsequent reading of the reserved bit returns 0 if the value last written to the bit was 0 and returns an unde?ed value (may be 0 or 1) otherwise. this means that even if the last value written to a reserved bit was 1, reading that bit may return 0. 2.1 overview on the altivec and powerpc registers the addition of altivec technology adds some additional new registers as well as affecting bit settings in some of the powerpc registers when altivec instructions are executed. figure 2-1 shows a graphic representation of the entire powerpc register set and how the altivec register set resides within the powerpc architecture. the powerpc registers affected by altivec instructions are shaded and altivec registers are highlighted as well. note that a processor that implements the powerpc architecture may have additional registers speci? only to that processor. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
2-2 altivec technology programming environments manual motorola overview on the altivec and powerpc registers figure 2-1. programming model?ll registers 1 these registers are defined as optional by the powerpc architecture. 2 these registers are defined by the altivec technology. dsisr data address register sprgs exception handling registers save and restore registers instruction bat registers data bat registers memory management registers machine state register msr (32) processor version register spr 287 pvr (32) con?uration registers user model?isa condition register general-purpose registers spr 8 link register lr (32) supervisor model?ea decrementer 1 external address register 1 ear (32) spr 9 count register miscellaneous registers segment registers cr (32) vector registers 2 time base facility (for writing) 1 user model?ea tbl (3 2) tbr 268 time base facility (for reading) ctr (32) tbu (32) tbr 269 ibat0u (32) ibat0l (32) ibat1u (32) ibat1l (32) ibat2u (32) ibat2l (32) ibat3u (32) ibat3l (32) spr 528 spr 529 spr 530 spr 531 spr 532 spr 533 spr 534 spr 535 spr 536 spr 537 spr 538 spr 539 spr 540 spr 541 spr 542 spr 543 dbat0u (32) dbat0l (32) dbat1u (32) dbat1l (32) dbat2u (32) dbat2l (32) dbat3u (32) dbat3l (32) sdr1 (32) spr 25 sprg0 (32) sprg1 (32) sprg2 (32) sprg3 (32) spr 272 spr 273 spr 274 spr 275 dar (32) dsisr (32) spr 19 spr 18 srr0 (32) spr 26 srr1 (32) spr 27 spr 282 tbl (32) tbr 284 tbu (32) tbr 285 dec (32) spr 22 data address breakpoint register 1 dabr (32) spr 1013 vector status and control register 2 vscr (32) processor id register 1 pir (32) spr 1023 altivec registers altivec save register 2 vrsave (32) spr 256 floating-point registers fpr0 (64) fpr1 (64) fpr31 (64) gpr0 (32) gpr1 (32) gpr31 (32) vr0 (128) vr1 (128) vr31 (128) sr0 (32) sr1 (32) sr15 (32) = altivec registers = powerpc registers used in the altivec technology sdr1 floating-point exception cause register 1 fpecr (32) spr 1022 spr 1 xer xer (32) floating-point status and control register fpscr (32) f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 2. altivec register set 2-3 altivec register set overview 2.2 altivec register set overview altivec registers, shown in figure 2-2 can be accessed by user or supervisor-level instructions. the vector registers (vrs) are accessed as instruction operands. access to the registers can be explicit (that is, through the use of speci? instructions for that purpose such as move from vector status and control register ( mfvscr) and move to vector status and control register ( mtvscr ) instructions) or implicit as part of the execution of an instruction. the vrs are accessed both explicitly and implicitly. the number to the right of the register name indicates the number used in the syntax of the instruction operands to access the register (for example, the number used to access the vrsave is spr 256). figure 2-2. altivec register set the user-level registers can be accessed by all software with either user or supervisor privileges. the user-level register set for altivec technology includes the following: vector registers (vrs): the vector register ?e consists of 32 vrs designated as vr0?r31. the vrs serve as vector source and vector destination registers for all vector instructions. see section 2.3.2, ?ector status and control register (vscr),?for more information. vector status and control register (vscr): the vscr contains the non-java and saturation bit with the remaining bits being reserved. see section 2.3.2, ?ector status and control register (vscr),?for more details. vector save/restore register (vrsave): the vrsave assists the application and operating system software in saving and restoring the architectural state across context-switched events. the bits in the vrsave can indicate whether the vector register is live (1) or dead (0). see section 2.3.3, ?ector save/restore register (vrsave),?for more information. vector registers vr0 vr1 vr31 vector status and control register vscr vector save /restore register vrsave spr 256 031 31 0 0 128 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
2-4 altivec technology programming environments manual motorola registers de?ed by altivec isa 2.3 registers de?ed by altivec isa altivec isa has de?ed several registers. the new altivec registers for the most part only interact with altivec instructions, with the exception of the vrsave register that is read or written by the powerpc instructions mfspr or mtspr , respectively. 2.3.1 altivec vector register file (vrf) the vrf, shown in figure 2-3, has 32 registers, each 128 bits wide. each vector register can hold sixteen 8-bit elements, eight 16-bit elements, or four 32-bit elements. figure 2-3. vector registers (vrs) the vector registers are accessed as vector instruction operands. access to registers are explicit as part of the execution of an altivec instruction. 2.3.2 vector status and control register (vscr) the vector status and control register (vscr) is a 32-bit vector register (not an spr) that is read and written in a manner similar to the fpscr in the powerpc scalar ?ating-point unit. the vscr is shown in figure 2-4 vr0 vr1 vr2 vr30 vr31 vr3 32-bits 16-bits 8-bits 128-bits 32 vector registers 1 9 10 11 12 13 14 15 16 1 1 2 2 2 3 3 3 4 4 4 5 5 6 6 7 7 8 8 0 128 sixteen 8-bit elements eight 16-bit elements four 32-bit elements vr4 vr5 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 2. altivec register set 2-5 registers de?ed by altivec isa the vscr has two de?ed bits, the altivec non-java mode (nj) bit (vscr[15]) and the altivec saturation (sat) bit (vscr[31]); the remaining bits are reserved. special instructions move from vector status and control register ( mfvscr) and move to vector status and control register ( mtvscr ) are provided to move the contents of vscr from and to a vector register. when moved to or from a vector register, the 32-bit vscr is right-justi?d in the 128-bit vector register. when moved to a vector register, the upper 96 bits vr n [0?5] of the vector register are cleared, so the vscr in a vector register looks as shown in figure 2-5 vscr bit settings are shown in table 2-1. 012345 field reserved nj reserved sat reset implementation speci? r/w r/w with mfvscr or mtvscr instruction figure 2-4. vector status and control register (vscr) 0 95 96 110 111 112 126 127 reserved reserved nj reserved sat figure 2-5. 32-bit vscr moved to a 128-bit vector register table 2-1. vscr field descriptions bit name description 0?4 reserved. the handling of reserved bits is the same as that for other powerpc registers. software is permitted to write any value to such a bit. a subsequent reading of the bit returns 0 if the value last written to the bit was 0 and returns an unde?ed value (0 or 1) otherwise. 15 nj non-java. this bit determines whether altivec ?ating-point operations are performed in a java-ieee-c9x?ompliant mode or a possibly faster non-java/non-ieee mode. 0 the java-ieee-c9x?ompliant mode is selected. denormalized values are handled as speci?d by java, ieee, and the c9x standard. 1 the non-java/non-ieee?ompliant mode is selected. if an element in a source vector register contains a denormalized value, the value 0 is used instead. if an instruction causes an under?w exception, the corresponding element in the target vr is cleared to 0. in both cases the 0 has the same sign as the denormalized or under?wing value. this mode is described in detail in the ?ating?oint overview section 3.2.1, ?loating-point modes. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
2-6 altivec technology programming environments manual motorola registers de?ed by altivec isa the mtvscr is context synchronizing. this implies that all altivec instructions logically preceding an mtvscr in the program ?w execute in the architectural context (nj mode) that existed before completion of mtvscr , and that all instructions logically following after mtvscr execute in the new context (nj mode) established by the mtvscr . after an mfvscr instruction executes, the result in the target vector register is architecturally precise. that is, it re?cts all updates to the sat bit that could have been made by vector instructions logically preceding it in the program ?w, and further, it will not re?ct any sat updates that may be made to it by vector instructions logically following it in the program ?w. because it is context synchronizing, mfvscr can be much slower than typical altivec instructions, and therefore care must be taken in reading it to avoid performance problems. 2.3.3 vector save/restore register (vrsave) the vrsave register shown in figure 2-6 is a user-level 32-bit spr used to assist in application and operating system software in saving and restoring the architectural state across process context-switched events. the vrsave is spr 256 and is entirely maintained and managed by software. 16?0 reserved. the handling of reserved bits is the same as that for other powerpc registers. software is permitted to write any value to such a bit. a subsequent reading of the bit returns 0 if the value last written to the bit was 0 and returns an unde?ed value (0 or 1) otherwise. 31 sat saturation. a sticky status bit indicating that some ?ld in a saturating instruction saturated since the last time sat was cleared. in other words, when sat = 1 it remains set to 1 until it is cleared to 0 by an mtvscr instruction. for further discussion refer to section 4.2.1.1, ?aturation detection. 0 indicates no saturation occurred; mtvscr can explicitly clear this bit. 1 the altivec saturate instruction is set when saturation occurs for the results one of altivec instructions having saturate in its name as follows: move to vscr ( mtvscr ) vector add integer with saturation ( vaddubs , vadduhs , vadduws , vaddsbs , vaddshs , vaddsws ) vector subtract integer with saturation ( vsububs , vsubuh s, vsubuws , vsubsbs , vsubshs , vsubsws ) vector multiply-add integer with saturation ( vmhaddshs , vmhraddshs ) vector multiply-sum with saturation ( vmsumuhs , vmsumshs , vsumsws ) vector sum-across with saturation ( vsumsws , vsum2sws , vsum4sbs , vsum4shs , vsum4ubs ) vector pack with saturation ( vpkuhus , vpkuwus , vpkshus , vpkswus , vpkshss , vpkswss ) vector convert to fixed-point with saturation ( vctuxs , vctsxs ) table 2-1. vscr field descriptions (continued) bit name description f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 2. altivec register set 2-7 additions to powerpc uisa registers figure 2-6. vector save/restore register (vrsave) vrsave bit settings are shown in figure 2-2 the vrsave register can be accessed only by the mfspr and mtspr instructions. each bit in this register corresponds to a vector register (vr) and indicates whether the corresponding register contains data that is currently in use by the executing process. therefore, the operating system needs to save and restore only those vrs when an exception occurs. if this approach is taken, it must be applied rigorously; if a program fails to indicate that a given vr is in use, software errors may occur that are dif?ult to detect and correct because they are timing-dependent. some operating systems save and restore vrsave only for programs that also use other altivec registers. 2.4 additions to powerpc uisa registers the powerpc uisa registers can be accessed by either user- or supervisor-level instruction. the one register affected by altivec architecture is the condition register (cr). the cr is a 32-bit register, divided into eight 4-bit ?lds, cr0?r7, that re?cts the results of certain arithmetic operations and provides a mechanism for testing and branching. for more details refer to chapter 2, ?egister set,?in the programming environments manual for 32-bit implementations of the powerpc architecture . 0123456789101112131415 field vr0 vr1 vr2 vr3 vr4 vr5 vr6 vr7 vr8 vr9 vr10 vr11 vr12 vr13 vr14 vr15 reset 0000_0000_0000_0000 r/w r/w with mfspr or mtspr instruction 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 field vr16 vr17 vr18 vr19 vr20 vr21 vr22 vr23 vr24 vr25 vr26 vr27 vr28 vr29 vr30 vr31 reset 0000_0000_0000_0000 r/w r/w with mfspr or mtspr instructions spr spr256 table 2-2. vrsave bit settings bits name description 0-31 vr n each bit in the vrsave register indicates whether the corresponding vr contains data in use by the executing process. 0 vr n is not being used for the current process 1 vr n is using vr n for the current process f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
2-8 altivec technology programming environments manual motorola additions to powerpc uisa registers 2.4.1 powerpc condition register the powerpc condition register (cr) is a 32-bit register that re?cts the result of certain operations and provides a mechanism for testing and branching. for altivec isa, the cr6 ?ld can optionally be used, that is if an altivec instruction ?lds record bit (rc) is set in a vector compare instruction. the cr6 ?ld is updated. the cr is divided into eight 4-bit ?lds, cr0?r7, as shown in figure 2-7 figure 2-7. condition register (cr) for more details on the cr see chapter 2, ?egister set,?in programming environments manual for 32-bit implementations of the powerpc architecture. to control program ?w based on vector data, all vector compare instructions can optionally update cr6. if the instruction ?lds record bit (rc) is set in a vector compare instruction, cr6 is updated according to table 2-3. the rc bit should be used sparingly because when rc = 1 it can cause a somewhat longer latency or be more disruptive to instruction pipeline ?w than when rc = 0. therefore techniques of accumulating results and testing infrequently are advised. 03478111215 field cr0 cr1 cr2 cr3 reset implementation speci? r/w r/w with mtcrf or mfcr instructions (cr6 can be the implicit result of vector compare instructions) 16 19 20 23 24 27 28 31 field cr4 cr5 cr6 cr7 reset implementation speci? r/w r/w with mtcrf or mfcr instructions (cr6 can be the implicit result of vector compare instructions) table 2-3. cr6 fields bit settings for vector compare instructions cr bit cr6 field bit vector compare vector compare bounds 24 0 1 relation is true for all element pairs 0 25 1 0 0 26 2 1 relation is false for all element pairs 0 all ?lds were in bounds 1 all ?lds are in bounds for the vcmpbfp instruction so the result code of all ?lds is 0b00 0 one of the ?lds is out of bounds for the vcmpbfp instruction 27 3 0 0 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 2. altivec register set 2-9 additions to powerpc oea registers 2.5 additions to powerpc oea registers the powerpc operating environment architecture (oea) can be accessed only by supervisor-level instructions. any attempt to access these sprs with user-level instructions results in a supervisor-level exception. for more details on the msr and srr see chapter 2, ?egister set,?in programming environments manual for 32-bit implementations of the powerpc architecture. 2.5.1 altivec field added in the powerpc machine state register (msr) an altivec available ?ld is added to the powerpc machine state register (msr). the msr is 32 bits wide as shown in figure 2-8. figure 2-8. machine state register (msr) in 32-bit powerpc implementations, bit 6, the vec ?ld, is added to the msr as shown in figure 2-8 also altivec data stream prefetching instructions will be suspended and resumed based on msr[pr] and msr[dr]. the data stream touch ( dst ) and data stream touch for store ( dstst ) instructions are supported whenever msr[dr] = 1. if either instruction is executed when msr[dr] = 0 (real addressing mode), the results are boundedly unde?ed. for each existing data stream, prefetching is enabled if msr[dr] = 1 and msr[pr] has the value it had when the dst or dstst instruction that speci?d the data stream was executed. otherwise prefetching for the data stream is suspended. in particular, the occurrence of an exception suspends all data stream prefetching. table 2-4 shows altivec bit de?itions for the msr as well as how the pr and dr bits are affected by altivec data stream instructions. 0 5 6 7 12 13 14 15 field reserved vec reserved pow res. ile reset implementation speci? r/w r with mfmsr , w with exception occurrence, mtmsr , sc , or r? instructions 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 field ee pr fp me fe0 se be fe1 res. ip ir dr res. ri le reset implementation speci? r/w r with mfmsr , w with exception occurrence, mtmsr , sc , or r? instructions f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
2-10 altivec technology programming environments manual motorola additions to powerpc oea registers for more detailed information including the other bit settings for msr, refer to chapter 2, ?egister set,?in programming environments manual for 32-bit implementations of the powerpc architecture . 2.5.2 machine status save/restore registers (srrs) the machine status save/restore registers (srrs) are part of the powerpc oea supervisor-level registers. the srr0 and srr1 registers are used to save machine status on exceptions and to restore machine status when an r? instruction is executed. for more detailed information, refer to chapter 2, ?egister set,?in programming environments manual for 32-bit implementations of the powerpc architecture . 2.5.2.1 machine status save/restore register 0 (srr0) the srr0 is a 32-bit register in 32-bit implementation. srr0 is used to save machine status on exceptions and restore machine status when an r instruction is executed. for altivec isa, it holds the effective address (ea) for the instruction that caused the altivec unavailable exception. the altivec unavailable exception occurs when no higher priority exception exists, and an attempt is made to execute an altivec instruction when msr[vec] = 0. the format of srr0 is shown in figure 2-9. table 2-4. msr bit settings bits name description 6 vec altivec available 0 altivec is disabled. 1 altivec is enabled. note: any attempt to execute a non-stream altivec instruction when the bit is cleared causes the processor to execute an ?ltivec unavailable exception when the instruction accesses the vrf or vscr register. this exception does not happen for data streaming instructions ( dst ( t ), dstst ( t ), and dss ), that is, the vrf and vscr registers are available to the data streaming instructions even when the msr[vec] is cleared. the vrsave register is not protected by msr [vec], that is, it can be accessed even when msr[vec] is cleared. 17 pr privilege level 0 the processor can execute both user- and supervisor-level instructions. 1 the processor can only execute user-level instructions. note: care should be taken if data stream prefetching is used in supervisor mode (msr[pr] = 0). for each existing data stream, prefetching is enabled if msr[dr] = 1 and msr[pr] has the value it had when the dst or dstst instruction that speci?d the data stream was executed. otherwise prefetching for the data stream is suspended. 27 dr data address translation 0 data address translation is disabled. if data stream touch ( dst ) and data stream touch for store ( dstst ) instructions are executed whenever dr = 0, the results are boundedly unde?ed 1 data address translation is enabled. data stream touch ( dst ) and data stream touch for store ( dstst ) instructions are supported whenever dr = 1. o f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 2. altivec register set 2-11 additions to powerpc oea registers 2.5.2.2 machine status save/restore register 1 (srr1) the srr1 is a 32-bit register in 32-bit implementation. srr1 is used to save machine status on exceptions and to restore machine status when an r instruction is executed. the format of srr1 is shown in figure 2-10. when an altivec unavailable exception occurs, srr1[1?] and srr[10?5] are cleared and all other srr1 bits are loaded from the msr as it was just prior to the interrupt. so msr[0], msr[5?], and msr[16?1] are placed into the corresponding bit positions of srr1 as they were before the exception was taken. 0 31 field holds effective address (ea) for instruction in interrupted program ?w reset implementation speci? r/w r/w with r figure 2-9. machine status save/restore register 0 (srr0) 0 31 field exception-speci? information and msr bit values reset implementation speci? r/w r/w with r figure 2-10. machine status save/restore register 0 (srr1) f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
2-12 altivec technology programming environments manual motorola additions to powerpc oea registers f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 3. operand conventions 3-1 chapter 3 operand conventions this chapter describes the operand conventions as they are represented in altivec technology at the user instruction set architecture (uisa) level. detailed descriptions are provided of conventions used for transferring data between vector registers and memory, and representing data in these vector registers using both big- and little-endian byte ordering. additionally, the ?ating-point default conditions for exceptions are described. 3.1 data organization in memory in addition to supporting byte, half-word and word operands, as de?ed in the powerpc architecture uisa, altivec instruction set architecture (isa) supports quad-word (128-bit) operands. the following sections describe the concepts of alignment and byte ordering of data for quad words, otherwise alignment is the same as described in chapter 3, ?perand conventions,?in the programming environments manual for 32-bit implementations of the powerpc architecture. 3.1.1 aligned and misaligned accesses vectors are accessed from memory with instructions such as vector load indexed ( lvx ) and store vector indexed ( stvx ) instructions. the operand of a vector register to memory access instruction has a natural alignment boundary equal to the operand length. in other words, the natural address of an operand is an integral multiple of the operand length. a memory operand is said to be aligned if it is aligned at its natural boundary; otherwise it is misaligned. each altivec instruction is a 4-byte word and is word-aligned like powerpc instructions. u f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
3-2 altivec technology programming environments manual motorola data organization in memory operands for vector register to memory access instructions have the characteristics shown in table 3-1. the concept of alignment is also applied more generally to data in memory. for example, an 8-byte data item is said to be half-word?ligned if its address is a multiple of two; that is, the effective address (ea) points to the next effective address that is 2 bytes (a half word) past the current effective address (ea + 2 bytes), and then the next being the ea + 4 bytes, and effective address would continue skipping every 2 bytes (2 bytes = 1 half word). this ensures that the effective address is half-word aligned as it points to each successive half word in memory. it is important to understand that altivec memory operands are assumed to be aligned, and altivec memory accesses are performed as if the appropriate number of low-order bits of the speci?d effective address were zero. this assumption is different from powerpc integer and ?ating-point memory access instructions where alignment is not always assumed. so for altivec isa, the low-order bit of the effective address is ignored for half-word altivec memory access instructions, and the low-order four bits of the effective address are ignored for quad-word altivec memory access instructions. the effect is to load or store the memory operand of the speci?d length that contains the byte addressed by the effective address. if a memory operand is misaligned, additional instructions must be used to correctly place the operand in a vector register or in memory. altivec technology provides instructions to shift and merge the contents of two vector registers. these instructions facilitate copying misaligned quad-word operands between memory and the vector registers. 3.1.2 altivec byte ordering for processors that implement the powerpc architecture and altivec technology, the smallest addressable memory unit is the byte (8 bits), and scalars are composed of one or more sequential bytes. altivec isa supports both big- and little-endian byte ordering. the default byte ordering is big-endian. however, the code sequence used to switch from big- to little-endian mode may differ among processors. table 3-1. memory operand alignment operand length 32-bit aligned address (28?1) 1 1 an x in an address bit position indicates that the bit can be 0 or 1 independent of the state of other bits in the address byte 8 bits (1 byte) xxxx half word 2 bytes xxx0 word 4 bytes xx00 quad word 16 bytes 0000 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 3. operand conventions 3-3 data organization in memory the powerpc architecture uses the machine state register (msr) for specifying byte ordering in little-endian mode (le). a value of 0 speci?s big-endian mode and a value of 1 speci?s little-endian mode. for further details on byte ordering in the powerpc architecture, refer to chapter 3, ?perand conventions,?in the programming environments manual for 32-bit implementations of the powerpc architecture. altivec isa follows the endian support of the powerpc architecture for elements up to double words with additional support for quad words. in altivec isa when a 64-bit scalar is moved from a register to memory, it occupies eight consecutive bytes in memory and a decision must be made regarding byte ordering in these eight addresses. 3.1.2.1 big-endian byte ordering for big-endian scalars, the most-signi?ant byte (msb) is stored at the lowest (or starting) address while the least-signi?ant byte (lsb) is stored at the highest (or ending) address. this is called big-endian because the big end of the scalar comes ?st in memory. 3.1.2.2 little-endian byte ordering for little-endian scalars, the lsb is stored at the lowest (or starting) address while the msb is stored at the highest (or ending) address. this is called little-endian because the little end of the scalar comes ?st in memory. 3.1.3 quad word byte ordering example the idea of big- and little-endian byte ordering is best illustrated in an example of a quad word such as 0x0011_2233_4455_6677_8899_aabb_ccdd_eeff located in memory. this quad word is used throughout this section to demonstrate how the bytes that comprise a quad word are mapped into memory. the quad word (0x0011_2233_4455_6677_8899_aabb_ccdd_eeff) is shown in big-endian mapping in figure 3-1. a hexadecimal representation is used for showing address values and the values in the contents of each byte. the address is shown below each bytes contents. the big-endian model addresses the quad word at address 0x00, which is the msb (0x00), proceeding to the address 0x0f, which contains the lsb (0xff) figure 3-1. big-endian mapping of a quad word byte 0 123456789101112131415 quad word contents 00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff address 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f msb lsb f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
3-4 altivec technology programming environments manual motorola data organization in memory figure 3-2 shows the same quad word using little-endian mapping. in the little-endian model, the quad words 0x00 address speci?s the lsb (0xff) and proceeds to address 0x0f which contains its msb (0x00). figure 3-2 shows the sequence of bytes laid out with addresses increasing from left to right. programmers familiar with little-endian byte ordering may be more accustomed to viewing quad words laid out with addresses increasing from right to left, as shown in figure 3-3. this allows the little-endian programmer to view each scalar in its natural byte order of msb to lsb. this section uses both conventions based on ease of understanding for the speci? example. 3.1.4 aligned scalars in little-endian mode the effective address (ea) calculation for the load and store instructions is described in chapter 4, addressing modes and instruction set summary.?for processors that implement the powerpc architecture in little-endian mode, the effective address is modi?d before being used to access memory. in the powerpc architecture, the three low-order address bits of the effective address are exclusive-ored (xor) with a three-bit value that depends on the length of the operand (1, 2, 4, or 8 bytes), as shown in table 3-2. this address modi?ation is called munging. byte 0 123456789101112131415 quad word contents ff ee dd cc bb aa 99 88 77 66 55 44 33 22 11 00 address 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f lsb msb figure 3-2. little-endian mapping of a quad word byte 0 123456789101112131415 quad word contents 00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff address 0f 0e 0d 0c 0b 0a 09 08 07 06 05 04 03 02 01 00 msb lsb figure 3-3. little-endian mapping of quad word?lternate view f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 3. operand conventions 3-5 data organization in memory the munged physical address is passed to the cache or to main memory, and the speci?d width of the data is transferred (in big-endian order?hat is, msb at the lowest address, lsb at the highest address) between a gpr or fpr and the addressed memory locations (as modi?d). munging makes it appear to the processor that individual aligned scalars are stored as little-endian, when in fact they are stored in big-endian order but at different byte addresses within double words. only the address is modi?d, not the byte order. for further details on how to align scalars in little-endian mode see chapter 3, ?perand conventions,?in programming environments manual for 32-bit implementations of the powerpc architecture . the powerpc address munging is performed on double-word units. in the powerpc architecture, little-endian mode would have the double words of a quad word appear swapped. when the quad word in memory shown at the top of figure 3-4, loads from address 0x00, the bottom of figure 3-4 shows how it appears to the processor as it munges the address. note that double words are swapped. the byte element addressed by the quad words base address, 0x0f, contains 0x28, while its msb at address 0x00 contains 0x27. this is due to the powerpc munging being applied to offsets within double words; altivec isa requires a munge within quad words. table 3-2. effective address modi?ations data width (bytes) ea modi?ation 1 xor with 0b111 2 xor with 0b110 4 xor with 0b100 8 no change byte 0 123456789101112131415 quad word contents 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f address 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f lsb msb byte 0 123456789101112131415 quad word contents 27 26 25 24 23 22 21 20 2f 2e 2d 2c 2b 2a 29 28 address 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f figure 3-4. quad word load with powerpc munged little-endian applied f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
3-6 altivec technology programming environments manual motorola data organization in memory to accommodate the quad-word operands, the powerpc architecture cannot simply be extended by munging an extra address bit. it would break existing code or platforms. processors that implement altivec technology could not be mixed with non-altivec processors. instead, altivec processors implement a double-word swap when moving quad words between vector registers and memory. figure 3-5 shows how this swapping could be implemented. this diagram represents the load path double-word swapping; the store path looks the same, except that the memory and internal boxes are reversed. figure 3-5. altivec little endian double-word swap in the diagram, the numbers at the bottom of the byte boxes represent the offset address of that byte; the numbers at the top are the values of the bytes at that offset.the little-endian ordering is discontinuous because the powerpc munging is performed only on double-word units. the purpose of the double word swap within the altivec unit is to perform an additional swap that is not part of the powerpc architecture. when msr[le] = 1, double words are swapped and the bytes appear in their expected ordering. when msr[le] = 0, no swapping occurs. to summarize, in little-endian mode, the load vector element indexed instructions ( lvebx , lvehx , and lvewx ) and the store vector element indexed instructions ( stvebx , stvehx , and stvewx ) have the same 3-bit address munge applied to the memory address as is speci?d by the powerpc architecture for integer and ?ating-point loads and stores. for the quad word load vector indexed instructions ( lvx and lvxl ) and the store vector indexed instructions ( stvx , stvxl ), the two double words of the quad-word scalar data are munged and swapped as they are moved between the vector register and memory. 3.1.5 vector register and memory access alignment when loading an aligned byte, half word, or word memory operand into a vector register, the element that receives the data is the element that would have received the data had the entire aligned quad word containing the memory operand addressed by the effective address been loaded. similarly, when an element in a vector register is stored into an aligned memory operand, the element selected to be stored is the element that would have been stored into the memory operand addressed by the effective address had the entire 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 27 26 25 24 23 22 21 20 2f 2e 2d 2c 2b 2a 29 28 01 msr[le] 01 msr[le] 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f memory image internal image contents address contents address 2f 2e 2d 2c 2b 2a 29 28 27 26 25 24 23 22 21 20 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 3. operand conventions 3-7 data organization in memory vector register been stored to the aligned quad word containing the memory operand addressed by the effective address. the position of the element in the target or source vector register depends on the endian mode, as described above. (byte memory operands are always aligned.) for aligned byte, half word, and word memory operands, if the corresponding element number is known when the program is written, the appropriate vector splat and vector permute instructions can be used to copy or replicate the data contained in the memory operand after loading the operand into a vector register. vector splat instructions will take the contents of an element in a vector register and replicates them into each element in the destination vector register. a vector permute instruction is the concatenation of the contents of two vectors. an example of this is given in detail in section 3.1.6, ?uad-word data alignment.?another method is to replicate the element across an entire vector register before storing it into an arbitrary aligned memory operand of the same length; the replication ensures that the correct data is stored regardless of the offset of the memory operand in its aligned quad word in memory. because vector loads and stores are size-aligned, application binary interfaces (abis) should specify, and programmers should take care to align data on quad-word boundaries for maximum performance. 3.1.6 quad-word data alignment altivec isa does not provide for alignment exceptions for loading and storing data. when performing vector loads and stores, the effect is as if the low-order four bits of the address are 0x0, regardless of the actual effective address generated. because vectors may often be misaligned due to the nature of the algorithm, altivec isa provides support for post-alignment of quad-word loads and pre-alignment for quad-word stores. note that in the following diagrams, the effect of the swapping described above is assumed and the memory diagrams will be shown with respect to the logical mapping of the data. figure 3-6 and figure 3-7 show misaligned vectors in memory for both big- and little-endian ordering. the big-endian and little-endian examples assumes that the desired vector begins at address 0x03. in the ?ure, hi denotes high-order quad word, and lo means low-order quad word. byte 012345678910111213141516171819202122232425262728293031 quad word hi quad word lo contents 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f address 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f msb lsb figure 3-6. misaligned vector in big-endian mode f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
3-8 altivec technology programming environments manual motorola data organization in memory figure 3-6 and figure 3-7 show how such misaligned data causes data to be split across aligned quad words; only aligned quad words are loaded or stored by altivec load/store instructions. to align this vector, a program must load both (aligned) quad words that contain a portion of the misaligned vector data and then execute a vector permute (vperm) instruction to align the result. 3.1.6.1 accessing a misaligned quad word in big-endian mode figure 3-1 shows the big-endian alignment model. using the example in figure 3-8, v hi and v lo represent vector registers that contain the misaligned quad words containing the msbs and lsbs, respectively, of the misaligned quad word; v d is the target vector register. figure 3-8. big-endian quad word alignment alignment is performed by left-rotating the combined 32-byte quantity (v hi: v lo) by an amount determined by the address of the ?st byte of the desired data. this left-rotation is done by means of a vperm instruction whose control vector is generated by a load vector for shift left ( lvsl ) instruction after loading the most-signi?ant quad word (msq) and least-signi?ant quad word (lsq) that contain the desired vector. the lvsl instruction uses the same address speci?ation as the load vector indexed that loads the v hi component, which for big-endian ordering is the address of the desired vector. the following instruction sequence extracts the quad word in big-endian mode: lvx vhi,ra,rb # load the msq lvsl vp,ra,rb # set the permute vector addi rb,rb,16 # address of lsq lvx vlo,ra,rb # load lsq component vperm vd,vhi,vlo,vp # align the data byte 3130292827262524232221201918171615141312111098765 4 3 2 10 quad word hi quad word lo contents 2f 2e 2d 2c 2b 2a 29 28 27 26 25 24 23 22 21 20 address 1f 1e 1d 1c 1b 1a 19 18 17 16 15 14 13 12 11 10 0f 0e 0d 0c 0b 0a 09 08 07 06 05 04 03 02 01 00 msb lsb figure 3-7. misaligned vector in little-endian addressing mode 10 v hi 00 0f 00 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f v lo 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f v d 0f 1f f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 3. operand conventions 3-9 data organization in memory note that when data streaming is used, the overhead of generating the alignment permute vector can be spread out and the latency of the loads may be absorbed by using loop unrolling. the process of storing a misaligned vector is essentially the reverse of that for loading, except that the code has a read-modify-write sequence. the logical algorithm is that the vector source must be right-shifted and split into two parts, each of which is merged (via a vector select ( vsel ) instruction) with the current contents of its msq and its lsq and stored back using a store vector indexed ( svx ) instruction. the load vector for shift right ( lvsr ) instruction is used to produce the permute control vector to be used for the right-shifting. note that a single register can be used for the shifted contents if a right-rotate is done. the rotate is performed by specifying the source register for both components of the vector permute ( vperm ); that is, a shift of a double register with the same contents in both parts results in a rotate. in addition, the same permute control vector can be used on a sequence of ones and zeros to generate a mask for use by the vsel instruction to do the merging. the complete code sequence for the store case is as follows: lvx vhi,ra,rb # load current msq for update lvsr vp,ra,rb # load the alignment vector addi rb,rb,16 # address of lsq lvx vlo,ra,rb # load the current lsq? data vspltisbv1s,-1 # generate the select mask bits vspltisbv0s,0 vperm vmask,v0s,v1s,vp # right shift the select mask vperm vsrc,vsrc,vsrc,vp # right rotate the data vsel vlo,vsrc,vlo,vmask # insert lsq component vsel vhi,vhi,vsrc,vmask # insert msq component stvx vlo,ra,rb # store lsq addi rb,rb,-16 # address of msq stvx vhi,ra,rb # store msq when fetching a misaligned stream, the control vector need only be computed once. thus the time required for aligned fetches on the ends of the stream is proportioned out. none of the data fetched internally to the stream is wasted and only gets fetched once. the average time spent for a misaligned lvx instruction in a long sequence approaches the latency of one lvx and one vperm instruction . f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
3-10 altivec technology programming environments manual motorola data organization in memory 3.1.6.2 accessing a misaligned quad word in little-endian mode the instruction sequences used to access misaligned quad-word operands in little-endian mode are similar to those used in big-endian mode. the following instruction sequence can be used to load the misaligned quad word shown in figure 3-7 into a vector register in little-endian mode. the load alignment case is shown in figure 3-9. the vector register v hi and v lo receive the msq and lsq respectively; v d is the target vector register. the lvsr instruction uses the same address speci?ation as an lvx that loads v lo; in little-endian byte ordering this is the address of the desired misaligned quad word. lvx vlo,ra,rb # load the lsq lvsr vp,ra,rb # set the permute vector addi rb,rb,16 # address of msq lvx vhi,ra,rb # load msq component vperm vd,vhi,vlo,vp # align the data similarly, the following sequence of instructions stores the contents of register v d into a misaligned quad word in memory in little-endian mode. lvx v lo ,ra,rb # load current lsq for update lvsl vp,ra,rb # load the alignment vector addi rb,rb,16 # address of msq lvx vhi,ra,rb # load the current msq? data vspltib v1s,-1 # generate the select mask bits vspltib v0s,0 vperm vmask,v0s,v1s,vp # left rotate the select mask vperm vsrc,vsrc,vsrc,vp # left rotate the data vsel vhi,vhi,vsrc,vmask # insert msq component vsel vlo,vsrc,vlo,vmask # insert lsq component stvx vhi,ra,rb # store msq addi rb,rb,-16 # address of lsq stvx vlo,ra,rb # store lsq f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 3. operand conventions 3-11 data organization in memory figure 3-9. little-endian alignment 3.1.6.3 scalar loads and stores no alignment is performed for scalar load or store instructions in altivec isa. if a vector load or store address is not properly size aligned, the suitable number of least signi?ant bits are ignored and a size aligned transfer occurs instead. data alignment must be performed explicitly after being brought into the registers. no assistance is provided for aligning individual scalar elements that are not aligned on their natural boundary. the placement of scalar data in a vector element depends upon its address. that is, the placement of the addressed scalar is the same as if a load vector indexed instruction has been performed, except that only the addressed scalar is accessed (for cache-inhibited space); the values in the other vector elements are boundedly unde?ed. also, data in the speci?d scalar is the same as if a store vector indexed instruction had been performed, except that only the scalar addressed is affected. no instructions are provided to assist in aligning individual scalar elements that are not aligned on their natural size boundary. when a program knows the location of a scalar, it can perform the correct vector splats and vector permutes to move data to where it is required. for example, if a scalar is to be used as a source for a vector multiply (that is, each element multiplied by the same value), the scalar must be splatted into a vector register. likewise, a scalar stored to an arbitrary memory location must be splatted into a vector register, and that register must be speci?d as the source of the store. this guarantees that the data appears in all possible positions of that scalar size for the store. 3.1.6.4 misaligned scalar loads and stores although no direct support of misaligned scalars is provided, the load-aligning sequence for big-endian vectors described in section 3.1.6.1, accessing a misaligned quad word in big-endian mode,?can be used to position the scalar to the left vector element, which can then be used as the source for a splat. that is, the address of a scalar is also the address of the left-most element of the quad word at that address. similarly, the read-modify-write sequences, with the mask adjusted for the scalar size, can be used to store misaligned scalars. the same is true for little-endian mode, the load-aligning sequence for little-endian vectors described section 3.1.6.2, accessing a misaligned quad word in little-endian mode?can be used to position the scalar to the right vector element, which can then be used 0f v hi 1f 0f 00 21 22 23 24 25 26 27 28 29 2a 2c 2d 2e 2f v lo 2f 2e 2d 2c 2b 2a 29 28 27 26 25 24 22 21 20 v d 00 10 23 20 2b f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
3-12 altivec technology programming environments manual motorola altivec floating-point instructions?isa as the source for a splat. that is, the address of a scalar is also the address of the right-most element of the quad word at that address. note that while these sequences work in cache-inhibited space, the physical accesses are not guaranteed to be atomic. 3.1.7 mixed-endian systems in many systems, the memory model is not as simple as the examples in this chapter. in particular, big-endian systems with subordinate little-endian buses (such as pci) comprise a mixed-endian environment. the basic mechanism to handle this is to use the vector permute ( vperm ) instruction to swap bytes within data elements. the value of the permute control vector depends on the size of the elements (8, 16, 32). that is, the permute control vector performs a parallel equivalent of the load word byte-reverse indexed ( lwbrx) powerpc instruction within the vector registers. the ultimate problem occurs when there are misaligned, mixed-endian vectors. this can be handled by applying a vector permute of the data as required for the misaligned case, followed by the swapping vector permute on that result. note that for streaming cases, the effect of this double permute can be accomplished by computing the swapping permute of the alignment permute vector and then applying the resulting permute control vector to incoming data. 3.2 altivec floating-point instructions?isa there are two kinds of ?ating-point instructions de?ed for the powerpc isa and altivec isa: computational noncomputational computational instructions are de?ed by the ieee-754 standard for 32-bit arithmetic (those that perform addition, subtraction, multiplication, and division) and the multiply-add de?ed by the architecture. noncomputational ?ating-point instructions consist of the ?ating-point load and store instructions. only the computational instructions are considered ?ating-point operations throughout this chapter. the single-precision format, value representations, and computational model to be de?ed in chapter 3, ?perand conventions,?in the programming environments manual for 32-bit implementations of the powerpc architecture , apply to altivec ?ating-point except as follows: u f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 3. operand conventions 3-13 altivec floating-point instructions?isa in general, no status bits are set to re?ct the results of ?ating-point operations. the only exception is that vscr[sat] may be set by the vector convert to fixed-point word instructions. with the exception of the two vector convert to fixed-point word ( vctuxs , vctsxs ) instructions and three of the four vector round to floating-point integer ( vr? , vr? , vr? ) instructions, all altivec ?ating-point instructions that round use the round-to-nearest rounding mode. floating-point exceptions cannot cause the system error handler to be invoked. if a function is required that is speci?d by the ieee standard, is not supported by altivec isa, and cannot be emulated satisfactorily using the functions that are supported by altivec isa, the functions provided by the ?ating-point processor should be used; see chapter 4, addressing modes and instruction set summary,?in programming environments manual for 32-bit implementations of the powerpc architecture. 3.2.1 floating-point modes altivec isa supports two ?ating-point modes of operation? java mode and a non-java mode of operation that is useful in circumstances where real-time performance is more important than strict java and ieee-standard compliance. when vscr[nj] is 0 (default), operations are performed in java mode. when vscr[nj] is 1, operations are carried out in the non-java mode. 3.2.1.1 java mode java compliance requires compliance with only a subset of the java/ieee/c9x standard. the java subset helps simplify ?ating-point implementations, as follows: reducing the number of operations that must be supported eliminating exception status ?gs and traps producing results corresponding to all disabled exceptions, thus eliminating enabling control ?gs requiring only round-to-nearest rounding mode eliminates directed rounding modes and the associated rounding control ?gs. java compliance requires the following aspects of the ieee standard: supporting denorms as inputs and results (gradual under?w) for arithmetic operations providing nan results for invalid operations nans compare unordered with respect to everything, so that the result of any comparison of any nan to any data type is always false. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
3-14 altivec technology programming environments manual motorola altivec floating-point instructions?isa in some implementations, ?ating-point operations in java mode may have somewhat longer latency on normal operands and possibly much longer latency on denormalized operands than operations in non-java mode. this means that in java mode overall real-time response may be somewhat worse and deadline scheduling may be subject to much larger variance than non-java mode. 3.2.1.2 non-java mode in the non-java/non-ieee/non-c9x mode (vscr[nj] = 1), gradual under?w is not performed. instead, any instruction that would have produced a denormalized result in java mode substitutes a correctly signed zero (?.0) as the ?al result. also, denormalized input operands are ?shed to the correctly signed zero (?.0) before being used by the instruction. the intent of this mode is to give programmers a way to assure optimum, data-insensitive, real-time response across implementations. another way to improved response time would be to implement denormalized operations through software emulation. it is architecturally permitted, but strongly discouraged, for an implementation to implement only non-java mode. in such an implementation, the vscr[nj] does not respond to attempts to clear it and is always read back as a 1. no other architecturally visible, implementation-speci? deviations from this speci?ation are permitted in either mode. 3.2.2 floating-point in?ities valid operations on in?ities are processed according to the ieee standard. 3.2.3 floating-point rounding all altivec ?ating-point arithmetic instructions use the ieee default rounding mode, round-to-nearest. the ieee directed rounding modes are not provided. 3.2.4 floating-point exceptions the following ?ating-point exceptions may occur during execution of altivec ?ating-point instructions. nan operand exception invalid operation exception zero divide exception log of zero exception over?w exception under?w exception f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 3. operand conventions 3-15 altivec floating-point instructions?isa if an exception occurs, a result is placed into the corresponding target element as described in the following subsections. this result is the default result speci?d by java, the ieee standard, or c9x, as applicable. recall that denormalized source values are treated as if they were zero when vscr[nj] =1. the consequences regarding exceptions are as follows: exceptions that can be caused by a zero source value can be caused by a denormalized source value when vscr[nj] = 1. exceptions that can be caused by a nonzero source value cannot be caused by a denormalized source value when vscr[nj] = 1. 3.2.4.1 nan operand exception if the exponent of a ?ating-point number is 255 and the fraction is non-zero, then the value is a nan. if the most signi?ant bit of the fraction ?ld of a nan is zero, then the value is a signaling nan (snan), otherwise it is a quiet nan (qnan). in all cases the sign of a nan is irrelevant. a nan operand exception occurs when a source value for any of the following instructions is a nan: an altivec instruction that would normally produce ?ating-point results either of the two, vector convert to unsigned fixed-point word saturate ( vctuxs ) or vector convert to signed fixed-point word saturate ( vctsxs ) instructions any of the four vector ?ating-point compare instructions. the following actions can be taken: if the altivec instruction would normally produce ?ating-point results, the corresponding result is a source nan selected as follows. in all cases, if the selected source nan is an snan, it is converted to the corresponding qnan (by setting the high-order bit of the fraction ?ld to 1 before being placed into the target element). if the element in register v a is a nan then the result is that nan else if the element in register v b is a nan then the result is that nan else if the element in register v c is a nan then the result is that nan if the instruction is either of the two vector convert to ?ed-point word instructions ( vctuxs , vctsxs ), the corresponding result is 0x0000_0000. vscr[sat] is not affected. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
3-16 altivec technology programming environments manual motorola altivec floating-point instructions?isa if the instruction is vector compare bounds floating-point ( vcmpbfp [ . ]), the corresponding result is 0xc000_0000. if the instruction is one of the other three vector ?ating-point compare instructions ( vcmpeqfp [ . ], vcmpfgefp [ . ], vcmpbfp [ . ]), the corresponding result is 0x0000_0000. 3.2.4.2 invalid operation exception an invalid operation exception occurs when a source value is invalid for the speci?d operation. the invalid operations are as follows: magnitude subtraction of in?ities multiplication of in?ity by zero vector reciprocal square root estimate float ( vrsqrtefp ) of a negative, nonzero number or -x log base 2 estimate ( vlogefp ) of a negative, nonzero number or -x the corresponding result is the qnan 0x7fc0_0000. this is the single-precision format analogy of the double precision format generated qnan described in chapter 3, ?perand conventions,?in programming environments manual for 32-bit implementations of the powerpc architecture. 3.2.4.3 zero divide exception a zero divide exception occurs when a vector reciprocal estimate floating-point ( vrefp ) or vector reciprocal square root estimate floating-point ( vrsqrtefp ) instruction is executed with a source value of zero. the corresponding result is in?ity, where the sign is the sign of the source value, as follows: 1/+0.0 + 1/-0.0 - 3.2.4.4 log of zero exception a log of zero exception occurs when a vector log base 2 estimate floating-point instruction ( vlogefp ) is executed with a source value of zero. the corresponding result is in?ity. the exception cases are as follows: vlogefp log 2 (?.0) - vlogefp log 2 (-x) qnan, where x 0 1 +0.0 () ? + 1 0.0 () ? - f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 3. operand conventions 3-17 altivec floating-point instructions?isa 3.2.4.5 over?w exception an over?w exception happens when either of the following conditions occurs: for an altivec instruction that would normally produce ?ating-point results, the magnitude of what would have been the result if the exponent range were unbounded exceeds that of the largest ?ite single-precision number. for either of the two vector convert to fixed-point word instructions ( vctuxs , vctsxs ), either a source value is an in?ity or the product of a source value and 2 unsigned immediate value (uimm) is a number too large to be represented in the target integer format. the following actions can be taken: if the altivec instruction would normally produce ?ating-point results, the corresponding result is in?ity, where the sign is the sign of the intermediate result. if the instruction is vector convert to unsigned fixed-point word saturate ( vctuxs ), the corresponding result is 0xffff_ffff if the source value is a positive number or +x, and is 0x0000_0000 if the source value is a negative number or -x. vscr[sat] is set. if the instruction is vector convert to signed fixed-point word saturate ( vcfsx ), the corresponding result is 0x7fff_ffff if the source value is a positive number or +x, and is 0x8000_0000 if the source value is a negative number or -x. vscr[sat] is set. 3.2.4.6 under?w exception under?w exceptions occur only for altivec instructions that would normally produce ?ating-point results. under?w is detected before rounding. under?w occurs when a nonzero intermediate result, computed as though both the precision and the exponent range were unbounded, is less in magnitude than the smallest normalized single-precision number (2 -126 ). the following actions can be taken: if vscr[nj] = 0, the corresponding result is the value produced by denormalizing and rounding the intermediate result. if vscr[nj] = 1, the corresponding result is a zero, where the sign is the sign of the intermediate result. 3.2.5 floating-point nans the altivec ?ating-point data format is compliant with the java/ieee/c9x single-precision format. a quantity in this format can represent a signed normalized number, a signed denormalized number, a signed zero, a signed in?ity, a quiet not a number (qnan), or a signaling nan (snan). f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
3-18 altivec technology programming environments manual motorola altivec floating-point instructions?isa 3.2.5.1 nan precedence whenever only one source operand of an instruction that returns a ?ating-point result is a nan, then that nan is selected as the input nan to the instruction. when more than one source operand is a nan, the precedence order for selecting the nan is ?st from v a then from v b and then from v c. if the selected nan is an snan, it is processed as described in section 3.2.5.2, ?nan arithmetic.?qnans, are processed according to section 3.2.5.3, ?nan arithmetic. 3.2.5.2 snan arithmetic whenever the input nan to an instruction is an snan, a qnan is delivered as the result, as speci?d by the ieee standard when no trap occurs. the delivered qnan is an exact copy of the original snan except that it is quieted; that is, the most-signi?ant bit (msb) of the fraction is a one. 3.2.5.3 qnan arithmetic whenever the input nan to an instruction is a qnan, it is propagated as the result according to the ieee standard. all information in the qnan is preserved through all arithmetic operations. 3.2.5.4 nan conversion to integer all nans convert to zero on conversions to integer instructions such as vctuxs and vctsxs . 3.2.5.5 nan production whenever the result of an altivec operation is a nan (for example, an invalid operation), the nan produced is a qnan with the sign bit = 0, exponent ?ld = 255, msb of the fraction ?ld = 1, and all other bits = 0. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-1 chapter 4 addressing modes and instruction set summary this chapter describes instructions and addressing modes de?ed by altivec instruction set architecture (isa) and according to the levels used by powerpc architecture?ser instruction set architecture (uisa) and virtual environment architecture (vea). altivec instructions are primarily uisa; if otherwise, they are noted in the chapter. these instructions are divided into the following categories: vector integer arithmetic instructions?hese include arithmetic, logical, compare, rotate, and shift instructions, described in section 4.2.1, ?ector integer instructions. vector ?ating-point arithmetic instructions?hese include ?ating-point arithmetic instructions as well as a discussion on ?ating-point modes, described in section 4.2.2, ?ector floating-point instructions. vector load and store instructions?hese include load and store instructions for vector registers, described in section 4.2.3, ?oad and store instructions. vector permutation and formatting instructions?hese include pack, unpack, merge, splat, permute, select, and shift instructions, described in section 4.2.5, ?ector permutation and formatting instructions. processor control instructions?hese instructions are used to read and write from the altivec status and control register, described in section 4.2.6, ?rocessor control instructions?isa. memory control instructions?hese instructions are used for managing caches (user level and supervisor level), described in section 4.3.1, ?emory control instructions?ea. this grouping of instructions does not necessarily indicate the execution unit that processes a particular instruction or group of instructions within a processor implementation. altivec integer instructions operate on byte, half-word, and word operands. floating-point instructions operate on single-precision operands. altivec isa uses word-length instructions that are word-aligned. it provides for byte, half-word, and word operand fetches and stores between memory and the vector registers (vrs). u v f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-2 altivec technology programming environments manual motorola conventions arithmetic and logical instructions do not read or modify memory. to use the contents of a memory location in a computation for an arithmetic or logical instruction, the following steps are taken: 1. the memory contents must be loaded into a register with a load instruction. 2. the contents are then modi?d. 3. the modi?d contents are written to the target location using a store instruction. 4.1 conventions this section describes conventions used for the altivec instruction set. descriptions of memory addressing, synchronization, and the altivec exception summary follow. 4.1.1 execution model when used with powerpc instructions, altivec instructions can be viewed as simply new powerpc instructions that are freely intermixed with existing ones to provide additional functionality. processors that implement the powerpc architecture appear to execute instructions in program order. some altivec implementations may not allow out-of-order execution and completion. non-data dependent vector instructions may issue and execute while longer latency instructions issued previously are still in the execute stage. register renaming avoids stalling dispatch on false dependencies and allows maximum register name reuse in heavily unrolled loops. the execution of a sequence of instructions will not be interrupted by exceptions since the unit does not report ieee exceptions, but rather produces the default results as speci?d in the java/ieee/c9x standards. the execution of a sequence of instructions may be interrupted only by a vector load or store instruction; otherwise, altivec instructions do not generate any exceptions. 4.1.2 computation modes altivec isa supports the powerpc isa. the altivec isa supports the 32-bit implementation of the powerpc architecture in that all registers except fprs and vrs are 32 bits long and the effective addresses are 32 bits long. this chapter describes only the instructions de?ed for 32-bit implementations of the powerpc architecture. 4.1.3 classes of instructions altivec instructions follow the illegal instruction class de?ed by powerpc architecture in the section, ?lasses of instructions,?in chapter 4, addressing modes and instruction set summary,?of the programming environments manual for 32-bit implementations of the powerpc architecture . for altivec isa, all unspeci?d encodings within the major opcode u f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-3 conventions (04) that are not de?ed are illegal powerpc instructions. the only exclusion in de?ing an unspeci?d encoding is an unused bit in an immediate ?ld or speci?r ?ld (///). 4.1.4 memory addressing a program references memory using the effective (logical) address computed by the processor when it executes a load, store, or cache instruction, and when it fetches the next sequential instruction. 4.1.4.1 memory operands bytes in memory are numbered consecutively starting with zero. each number is the address of the corresponding byte. memory operands may be bytes, half words, words, or quad words for altivec instructions. the address of a memory operand is the address of its ?st byte (that is, of its lowest-numbered byte). operand length is implicit for each instruction. altivec isa supports both big-endian and little-endian byte ordering. the default byte and bit ordering is big-endian; see section 3.1.2, altivec byte ordering,?for more information. the natural alignment boundary of an operand of a single-register memory access instruction is equal to the operand length. in other words, the natural address of an operand is an integral multiple of the operand length. a memory operand is said to be aligned if it is aligned at its natural boundary; otherwise it is misaligned. for a detailed discussion about memory operands, see section 3.1, ?ata organization in memory. 4.1.4.2 effective address calculation an effective address (ea) is the 32-bit sum computed by the processor when executing a memory access or when fetching the next sequential instruction. for a memory access instruction, if the sum of the ea and the operand length exceeds the maximum ea, the memory operand is considered to wrap around from the maximum ea through ea 0, as described in the chapter 4, addressing modes and instruction set summary,?in the programming environments manual for 32-bit implementations of the powerpc architecture . a zero in the r a ?ld indicates the absence of the corresponding address component. for the absent component, a value of zero is used for the address. this is shown in the instruction description as ( r a|0). in all implementations of processors that support the powerpc architecture, the processor can modify the three low-order bits of the calculated effective address before accessing memory if the system is operating in little-endian mode. the double words of a quad word may be swapped as well. see section 3.1.2, altivec byte ordering,?for more information about little-endian mode. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-4 altivec technology programming environments manual motorola altivec uisa instructions altivec load and store operations use register indirect with index mode and boundary align to generate effective addresses. for further details see section 4.2.3.2, ?oad and store address generation. 4.2 altivec uisa instructions altivec instructions can provide additional supporting instructions to powerpc architecture. this section discusses the instructions de?ed in altivec user instruction set architecture (uisa). 4.2.1 vector integer instructions the following are categories for vector integer instructions: arithmetic compare logical rotate and shift integer instructions use the content of the vector registers (vrs) as source operands and place results into vrs as well. setting the rc bit of a vector compare instruction causes the powerpc condition register (cr) to be updated. altivec integer instructions treat source operands as signed integers unless the instruction is explicitly identi?d as performing an unsigned operation. for example, vector add unsigned word modulo ( vadduwm ) and vector multiply odd unsigned byte ( vmuloub ) instructions interpret both operands as unsigned integers. 4.2.1.1 saturation detection most integer instructions have both signed and unsigned versions and many have both modulo (wrap-around) and saturating clamping modes. saturation occurs whenever the result of a saturating instruction does not ? in the result ?ld. unsigned saturation clamps results to zero on under?w and to the maximum positive integer value (2 n -1, for example, 255 for byte ?lds) on over?w. signed saturation clamps results to the smallest representable negative number (-2 n-1 , for example, -128 for byte ?lds) on under?w, and to the largest representable positive number (2 n-1 -1, for example, +127 for byte ?lds) on over?w. when a modulo instruction is used, the resultant number truncates over?w or under?w for the length (byte, half word, word, quad word) and type of operand (unsigned, signed). the altivec isa provides a way to detect saturation and sets the sat bit in the vector status and control register (vscr[sat]) in a saturating instruction. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-5 altivec uisa instructions borderline cases that generate results equal to saturation values, for example unsigned 0+0 0 and unsigned byte 1+254 255, are not considered saturation conditions and do not cause vscr[sat] to be set. the vscr[sat] can be set by the following types of integer, ?ating-point, and formatting instructions: move to vscr ( mtvscr ) vector add integer with saturation ( vaddubs , vadduhs , vadduws , vaddsbs , vaddshs , vaddsws ) vector subtract integer with saturation ( vsububs , vsubuh s, vsubuws , vsubsbs , vsubshs , vsubsws ) vector multiply-add integer with saturation ( vmhaddshs , vmhraddshs ) vector multiply-sum with saturation ( vmsumuhs , vmsumshs , vsumsws ) vector sum-across with saturation ( vsumsws , vsum2sws , vsum4sbs , vsum4shs , vsum4ubs ) vector pack with saturation ( vpkuhus , vpkuwus , vpkshus , vpkswus , vpkshss , vpkswss ) vector convert to ?ed-point with saturation ( vctuxs , vctsxs ) note that only instructions that explicitly call for saturation can set vscr[sat]. modulo integer instructions and ?ating-point arithmetic instructions never set vscr[sat]. for further details see section 2.3.2, ?ector status and control register (vscr). 4.2.1.2 vector integer arithmetic instructions table 4-1 lists the integer arithmetic instructions for processors that implement the powerpc architecture. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-6 altivec technology programming environments manual motorola altivec uisa instructions table 4-1. vector integer arithmetic instructions name mnemonic syntax operation vector add unsigned integer [b,h,w] modulo vaddubm vadduhm vadduwm v d ,v a, v b places the sum ( v a[unsigned integer elements]) + ( v b[unsigned integer elements]) into v d[unsigned integer elements] using modulo arithmetic. for b , byte, integer length = 8 bits =1 byte, add sixteen unsigned integers from v a to the corresponding sixteen unsigned integers from v b. for h , half word, integer length =16 bits = 2 bytes, add eight unsigned integers from v a to the corresponding eight unsigned integers from v b. for w , word, integer length = 32 bits = 4 bytes, add four unsigned integers from v a to the corresponding four unsigned integers from v b. note: unsigned or signed integers can be used with these instructions. vector add unsigned integer [b,h,w] saturate vaddubs vadduhs vadduws v d ,v a, v b place the sum ( v a[unsigned integer elements]) + ( v b[unsigned integer elements]) into v d[unsigned integer elements] using saturate clamping mode. saturate clamping mode means if the resulting sum is >(2 n -1) saturate to (2 n -1), where n = b , h , w. fo r b , byte, integer length = 8 bits = 1 byte, add sixteen unsigned integers from v a to the corresponding sixteen unsigned integers from vb. for h , half word, integer length = 16 bits = 2 bytes, add eight unsigned integers from v a to the corresponding eight unsigned integers formable. for w , word, integer length = 32 bits = 4 bytes, add four unsigned integers from v a to the corresponding four unsigned integers from v b. if the result saturates, vscr[sat] is set. vector add signed integer[b,h,w] saturate vaddsbs vaddshs vddsws v d ,v a ,v b place the sum ( v a[signed integer elements]) + ( v b[signed integer elements]) into v d[signed integer elements] using saturate clamping mode. saturate clamping mode means: if the sum is >(2 n-1 -1) saturate to (2 n-1 -1) and if < (- 2 n-1 ) saturate to (-2 n-1 ), where n = b , h , w. for b , byte, integer length = 8 bits = byte, add sixteen signed integers from v a to the corresponding sixteen signed integers from v b. for h , half word, integer length = 16 bits = 2 bytes, add eight signed integers from v a to the corresponding eight signed integers from v b. for w , word, integer length = 32 bits = 4 bytes, add four signed integers from v a to the corresponding four signed integers from v b. if the result saturates, vscr[sat] is set. vector add and write carry-out unsigned word vaddcuw v d ,v a ,v b take the carry out of summing ( v a) + ( v b) and place it into v d. for w , word, integer length = 32 bits = 2 bytes, add four unsigned integers from v a to the corresponding four unsigned integers from v b and the resulting carry outs are correspondingly placed in v d. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-7 altivec uisa instructions vector subtract unsigned integer modulo [b,h,w] vsububm vsubuhm vsubuwm v d ,v a ,v b place the unsigned integer sum ( v a) - ( v b) into v d using modulo arithmetic. for b , byte, integer length = 8 bits =1 byte, subtract sixteen unsigned integers in v b from the corresponding sixteen unsigned integers in v a. for h , half word, integer length = 16 bits = 2 bytes, subtract eight unsigned integers in v b from the corresponding eight unsigned integers in v a. for w , word, integer length = 32 bits = 4 bytes, subtract four unsigned integers in v b from the corresponding four unsigned integers in v a. note that unsigned or signed integers can be used with these instructions. vector subtract unsigned integer saturate [b,h,w] vsububs vsubuhs vsubuws v d ,v a ,v b place the unsigned integer sum v a - v b into v d using saturate clamping mode, that is, if the sum < 0, it saturates to 0 corresponding to b , h , w . for b , byte, integer length = 8 bits = 1 byte, subtract sixteen unsigned integers in v b from the corresponding sixteen unsigned integers in v a. for h , half word, integer length =16 bits = 2 bytes, subtract eight unsigned integers in v b from the corresponding eight unsigned integers in v a. for w , word, integer length = 32 bits = 4 bytes, subtract four unsigned integers in v b from the corresponding four unsigned integers in v a. if the result saturates, vscr[sat] is set. vector subtract signed integer saturate [b,h,w] vsubsbs vsubshs vsubsws v d ,v a ,v b place the signed integer sum ( v a) - ( v b) into v d using saturate clamping mode. saturate clamping mode means: if the sum is >(2 n-1 -1) saturate to (2 n-1 -1) and if < (- 2 n-1 ) saturate to (-2 n-1 ), where n= b , h , w. for b , byte, integer length = 8 bits = 1 byte, subtract sixteen signed integers in v b from the corresponding sixteen signed integers in v a. for h , half word, integer length = 16 bits = 2 bytes, subtract eight signed integers in v b from the corresponding eight signed integers in v a. for w , word, integer length = 32 bits = 4 bytes, subtract four signed integers in v b from the corresponding four signed integers in v a. vector subtract and write carry-out unsigned word vsubcuw v d ,v a ,v b take the carry out of the sum ( v a) - ( v b) and place it into v d. for w , word, integer length = 32 bits = 2 bytes, subtract four unsigned integers in v b from the corresponding four unsigned integers in v a and place the resulting carry outs into v d. table 4-1. vector integer arithmetic instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-8 altivec technology programming environments manual motorola altivec uisa instructions vector multiply odd unsigned integer [b,h] modulo vmuloub vmulouh v d ,v a, v b place the unsigned integer products of ( v a) * ( v b) into v d using modulo arithmetic mode. for b , byte, integer length = 8 bits =1 byte, multiply 8 odd-numbered unsigned integer byte elements from v a to the corresponding 8 odd-numbered unsigned integer byte elements from v b resulting in eight unsigned integer half-word products in v d. for h , half word, integer length =16 bits = 2 bytes, multiply 4 odd-numbered unsigned integer half word elements from v a to the corresponding 4 odd numbered unsigned integer half-word elements from v b resulting in four unsigned integer word products in v d . vector multiply odd signed integer [b,h] modulo vmulosb vmulosh v d ,v a, v b place the signed integer product of ( v a) * ( v b) into v d using modulo arithmetic mode. for b , byte, integer length = 8 bits = 1 byte, multiply 8 odd-numbered signed integer byte elements from v a to 8 odd-numbered signed integer byte elements from v b resulting in eight signed integer half-word products in v d . for h , half word, integer length = 16 bits = 2 bytes, multiply 4 odd-numbered signed integer half word elements from v a to 4 odd-numbered signed integer half word elements from v b resulting in four signed integer word products in v d . vector multiply even unsigned integer [b,h] modulo vmuleub vmuleuh v d ,v a, v b place the unsigned integer products of ( v a) * ( v b) into v d using modulo arithmetic mode. for b , byte, integer length = 8 bits =1 byte, multiply 8 even-numbered unsigned integer byte elements from v a to 8 even-numbered unsigned integer byte elements from v b resulting in eight unsigned integer half-word products in v d . for h , half word, integer length = 16 bits = 2 bytes, multiply 4 even-numbered unsigned integer half-word elements from v a to 4 even numbered unsigned integer half- word elements from v b resulting in four unsigned integer word products in v d vector multiply even signed integer [b,h] modulo vmulesb vmulesh v d ,v a, v b place the signed integer product of ( v a) * ( v b) into v d using modulo arithmetic mode. for b , byte, integer length = 8 bits = 1 byte, multiply 8 even-numbered signed integer byte elements from v a to 8 even-numbered signed integer byte elements from v b resulting in eight signed integer half-word products in v d . for h , half word, integer length = 16 bits = 2 bytes, multiply 4 even-numbered signed integer half-word elements from v a to 4 even-numbered signed integer half-word elements from v b resulting in four signed integer word products in v d . table 4-1. vector integer arithmetic instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-9 altivec uisa instructions vector multiply-high and add signed half-word saturate vmhaddshs v d ,v a, v b, v c the 17 most signi?ant bits (msbs)of the product of ( v a) * ( v b) adds to sign-extended v c and places the result into v d. for h , half word, integer length = 16 bits = 2 bytes, multiply the eight signed half words from v a with the corresponding eight signed half words from v b to produce a 32-bit intermediate product and then take the 17 msbs (bits 0?6) of the 8 intermediate products and add them to the 8 sign-extended half words in v c, place the 8 half-word saturated results in v d. if the intermediate product is as follows: > (2 15 ?) saturate to (2 15 ?) and if < ? 15 saturate to ? 15 . if the results saturates, vscr[sat] is set. vector multiply-high round and add signed half-word saturate vmhraddshs v d ,v a, v b, v c add the rounded product of ( v a) * ( v b) to sign-extended v c and place the result into v d. for h , half word, integer length = 16 bits = 2 bytes, multiply the eight signed integers from v a to the corresponding eight signed integers from v b and then round the 8 immediate products by adding the value 0x0000_4000 to it. then add the most signi?ant bits (msb), bits 0?6, of the 8 rounded immediate products to the 8 sign-extended values in v c and place the eight signed half-word saturated results into v d. if the intermediate product is: > (2 15 ?) saturate to (2 15 ?) or if < ? 15 saturate to ? 15 . if the result saturates, vscr[sat] is set. vector multiply-low and add unsigned half-word modulo vmladduhm v d ,v a, v b, v c add the product of ( v a) * ( v b) to zero-extended v c and place into v d. for h , half word, integer length =16 bits = 2 bytes, multiply the eight signed integers from v a to the corresponding eight signed integers from v b to produce a 32-bit intermediate product. the 16-bit value in v c is zero-extended to 32 bits and added to the intermediate product and the lower 16 bits of the sum (bit 16?1) is placed in v d. note that unsigned or signed integers can be used with these instructions. vector multiply-sum unsigned integer [b,h] modulo vmsumubm vmsumuhm v d ,v a, v b, v c the product of ( v a) * ( v b) is added to zero-extended v c and placed into v d using modulo arithmetic. for b , byte, integer length = 8 bits = 1 byte, multiply four unsigned integer bytes from a word element in v a by the corresponding four unsigned integer bytes in a word element in v b and the sum of these products are added to the zero-extended unsigned integer word element in v c and then placed the unsigned integer word result into v d, following this process for each 4-word element in v a and v b. for h , half word, integer length = 16 bits = 2 bytes, multiply 2 unsigned integer half words from a word element in v a by the corresponding 2 unsigned integer half words in a word element in v b and the sum of these products are added to zero-extended unsigned integer word element in v c and then place the unsigned integer word result into v d, following this process for each 4 word element in v a and v b. table 4-1. vector integer arithmetic instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-10 altivec technology programming environments manual motorola altivec uisa instructions vector multiply-sum signed half-word saturate vmsumshs v d ,v a, v b, v c add the product of ( v a) * ( v b) to v c and place the result into v d using saturate clamping mode. for h , half word, integer length = 16 bits = 2 bytes, multiply 2 signed integer half words from a word element in v a by the corresponding 2 signed integer half words in a word element in v b. add the sum of these products to the signed integer word element in v c and then place the signed integer word result into v d, (following this process for each 4-word element in v a and v b). if the intermediate result is > (2 31 ?), saturate to (2 31 ?) and if the result is < -2 31 , saturate to -2 31 . if the result saturates, vscr[sat] is set. vector multiply-sum unsigned half-word saturate vmsumuhs v d ,v a, v b, v c add the product of ( v a) * ( v b) to zero-extended v c and place the result into v d using saturate clamping mode. for h , half word, integer length = 16 bits = 2 bytes, multiply 2 unsigned integer half words from a word element in v a by the corresponding 2 unsigned integer half words in a word element in v b. add the sum of these products to the zero-extended unsigned integer word element in v c and then place the unsigned integer word result into v d, (following this process for each 4-word element in v a and v b). if the intermediate result is > (2 32 ?) saturate to (2 32 ?). if the result saturates, vscr[sat] is set. vector multiply-sum mixed sign byte modulo vmsummbm v d ,v a, v b, v c add the product of ( v a) * ( v b) to v c and place into v d using modulo arithmetic. for b , byte, integer length = 8 bits = 1 byte, multiply four signed integer bytes from a word element in v a by the corresponding four unsigned integer bytes from a word element in v b. add the sum of these four signed products to the signed integer word element in v c and then place the signed integer word result into v d, following this process for each 4-word element in v a and v b. vector multiply-sum signed half-word modulo vmsumshm v d ,v a, v b, v c add the product of ( v a) * ( v b) to v c and place into v d using modulo arithmetic. for h , half word, integer length = 16 bits = 2 bytes, multiply 2 signed integer half words from a word element in v a by the corresponding 2 signed integer half words in a word element in v b. add the sum of these 2 products to the signed integer word element in v c and then place the signed integer word result into v d, following this process for each 4-word element in v a and v b. vector sum across signed word saturate vsumsws v d ,v a, v b place the sum of signed word elements in v a and the word in v b[96?27] into v d. for w , word, integer length = 32 bits = 4 bytes, add the sum of the four signed integer word elements in va to the word element in vb[96-127]. if the intermediate product is > (2 31 ?) saturate to (2 31 ?) and if < ? 31 saturate to ? 31 . place the signed integer result in v d[96-127], v d[0-95] are cleared. table 4-1. vector integer arithmetic instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-11 altivec uisa instructions vector sum across partial (1/2) signed word saturate vsum2sws v d ,v a, v b add v a[word 0 + word 1] + v b[word 1] and place in v d[word 1]. repeat only add v a[word 2 + word 3] + v b[word 3] and place in v d[word 3]. word 0 = bits 0?1 word 1 = bits 32-63 word 2 = bits 64-95 word 3 = bits 96-127, figure1-2 shows a picture of what the word elements would look like in a vector register. add the sum of word 0 and word 1 of v a to word 1 of v b using saturate clamping mode and place the result is into word 1of v d. then add the sum of word 2 and word 3 of ( v a) to word 3 of v b using saturate clamping mode and place those results into word 3 in v d. if the intermediate result for either calculation is > (2 31 ?) then saturate to (2 31 ?) and if < ? 31 then saturate to ? 31 . if the result saturates, vscr[sat] is set. vector sum across partial (1/4) unsigned byte saturate vsum4ubs v d ,v a, v b add v a[4 byte elements sum to a word] and v b[word element] then place in v d[word element] using saturate clamping mode. for b , byte, integer length = 8 bits = 1 byte, for each word element in v b, add the sum of four unsigned bytes in the word in va to the unsigned word element in v b and then place the results into the corresponding unsigned word element in v d. if the intermediate result for is > (2 32 ?) it saturates to (2 32 ?). if the result saturates, vscr[sat] is set. vector sum across partial (1/4) signed integer saturate vsum4sbs vsum4shs v d ,v a, v b add v a[sum of signed integer elements in word] and v b[word element] then place in v d[word element] using saturate clamping mode. for b , byte, integer length = 8 bits = 1 byte, for each word element in v b, add the sum of four signed bytes in the word in v a to the signed word element in v b and then place the results into the corresponding signed word element in v d. if the intermediate result is > (2 31 ?) then saturate to (2 31 ?) and if < ? 31 then saturate to ? 31 . for h , half word, integer length = 16 bits = 2 bytes, for each word element in v b, add the sum of 2 signed half words in the word in va to the signed word element in v b and then place the results into the corresponding signed word element in v d. if the intermediate result is > (2 31 ?) then saturate to (2 31 ?) and if < ? 31 then saturate to ? 31 . if the result saturates, vscr[sat] is set. table 4-1. vector integer arithmetic instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-12 altivec technology programming environments manual motorola altivec uisa instructions vector average unsigned integer [b,h,w] vavgub vavguh vavguw v d ,v a, v b add the sum of ( v a[unsigned integer elements]+ v b[unsigned integer elements]) +1 and place into v d using modulo arithmetic. for b , byte, integer length = 8 bits = 1 byte, add sixteen unsigned integers from v a to sixteen unsigned integers from v b and then add 1 to the sums and place the high order result in v d. for h , half word, integer length = 16 bits = 2 bytes, add eight unsigned integers from v a to eight unsigned integers from v b and then add 1 to the sums and place the high order result in v d. for w , word, integer length = 32 bits = 4 bytes, add four unsigned integers from v a to four unsigned integers from v b and then add 1 to the sums and place the high order result in v d. if the result saturates, vscr[sat] is set. vector average signed integer [b,h,w] vavgsb vavgsh vavgsw v d ,v a, v b add the sum of ( v a[signed integer elements]+ v b[signed integer elements]) +1 and place into v d using modulo arithmetic. for b , byte, integer length = 8 bits = 1 byte, add sixteen signed integers from v a to sixteen signed integers from v b and then add 1 to the sums and place the high order result in v d. for h , half word, integer length = 16 bits = 2 bytes, add eight signed integers from v a to eight signed integers from v b and then add 1 to the sums and place the high order result in v d. for w , word, integer length = 32 bits = 4 bytes, add four signed integers from v a to four signed integers from v b and then add 1 to the sums and place the high order result in v d. vector maximum unsigned integer [b,h,w] vmaxub vmaxuh vmaxuw v d ,v a ,v b compare the maximum of v a and v b unsigned integers for each integer value and which ever value is larger, place that unsigned integer value into v d for b , byte, integer length = 8 bits = 1 byte, compare sixteen unsigned integers from v a with sixteen unsigned integers from v b. for h , half word, integer length = 16 bits = 2 bytes, compare eight unsigned integers from v a with eight unsigned integers from v b. for w , word, integer length = 32 bits = 4 bytes, compare four unsigned integers from v a with four unsigned integers from v b. vector maximum signed integer [b,h,w] vmaxsb vmaxsh vmaxsw v d ,v a ,v b compare the maximum of v a and v b signed integers for each integer value and which ever value is larger, place that signed integer value into v d for b , byte, integer length = 8 bits =1 byte, compare sixteen signed integers from v a with sixteen signed integers from v b. for h , half word, integer length =16 bits = 2 bytes, compare eight signed integers from v a with eight signed integers from v b. for w , word, integer length = 32 bits = 4 bytes, compare four signed integers from v a with four signed integers from v b. table 4-1. vector integer arithmetic instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-13 altivec uisa instructions 4.2.1.3 vector integer compare instructions the vector integer compare instructions algebraically or logically compare the contents of the elements in vector register v a with the contents of the elements in v b. each compare result vector is comprised of true (0xff, 0xffff, 0xffffffff) or false (0x00, 0x0000, 0x00000000) elements of the size speci?d by the compare source operand element (byte, half word, or word). the result vector can be directed to any vector register and can be manipulated with any of the instructions as normal data, for example, combining condition results. vector compares provide equal-to and greater-than predicates. others are synthesized from these by logically combining or inverting result vectors. if the record bit (rc) is set in the integer compare instructions (shown in table 4-3), it can optionally set the cr6 ?ld of the powerpc condition register. if rc = 1 in the vector integer compare instruction, then cr6 re?cts the result of the comparison, as shown in table 4-2. vector minimum unsigned integer [b,h,w] vminub vminuh vminuw v d ,v a ,v b compare the minimum of v a and v b unsigned integers for each integer value and which ever value is smaller, place that unsigned integer value into v d. for b , byte, integer length = 8 bits = 1 byte, compare sixteen unsigned integers from v a with sixteen unsigned integers from v b. for h , half word, integer length = 16 bits = 2 bytes, compare eight unsigned integers from v a with eight unsigned integers from v b. for w , word, integer length = 32 bits = 4 bytes, compare four unsigned integers from v a with four unsigned integers from v b. vector minimum signed integer [b,h,w] vminsb vminsh vminsw v d ,v a ,v b compare the minimum of v a and v b signed integers for each integer value and which ever value is smaller, place that signed integer value into v d. for b , byte, integer length = 8 bits = 1 byte, compare sixteen signed integers from v a with sixteen signed integers from v b. for h , half word, integer length = 16 bits = 2 bytes, compare eight signed integers from v a with eight signed integers from v b. for w , word, integer length = 32 bits = 4 bytes, compare four signed integers from v a with four signed integers from v b. table 4-2. cr6 field bit settings for vector integer compare instructions cr bit cr6 bit vector compare 24 0 1 relation is true for all element pairs (that is, v d is set to all ones). 25 1 0 26 2 1 relation is false for all element pairs (that is, register v d is cleared). 27 3 0 table 4-1. vector integer arithmetic instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-14 altivec technology programming environments manual motorola altivec uisa instructions table 4-3 summarizes the vector integer compare instructions. table 4-3. vector integer compare instructions name mnemonic syntax operation vector compare greater than unsigned integer [b,h,w] vcmpgtub[.] vcmpgtuh[.] vcmpgtuw[.] v d ,v a ,v b compare the value in v a with the value in v b, treating the operands as unsigned integers. place the result of the comparison into the v d ?ld speci?d by operand v d. if v a > v b then v d = 1s; otherwise v d = 0s. if the record bit (rc) is set in the vector compare instruction, then v d == 1s, (all elements true) then cr6[0] is set v d == 0s, (all elements false) then cr6[2] is set. for b , byte, integer length = 8 bits = 1 byte, compare sixteen unsigned integers from v a to sixteen unsigned integers from v b and place the results in the corresponding 16 elements in v d. for h , half word, integer length = 16 bits = 2 bytes, compare eight unsigned integers from v a to eight unsigned integers from v b and place the results in the corresponding 8 elements in v d. for w , word, integer length = 32 bits = 4 bytes, compare four unsigned integers from v a to four unsigned integers from v b and place the results in the corresponding 4 elements in v d. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-15 altivec uisa instructions 4.2.1.4 vector integer logical instructions the vector integer logical instructions shown in table 4-4 perform bit-parallel operations on the operands. vector compare greater than signed integer [b,h,w] vcmpgtsb[.] vcmpgtsh[.] vcmpgtsw[.] v d ,v a ,v b compare the value in v a with the value in v b, treating the operands as signed integers. place the result of the comparison into the v d ?ld speci?d by operand v d. if v a > v b then v d =1s; otherwise v d = 0s if the record bit (rc) is set in the vector compare instruction, then v d == 1s, (all elements true) then cr6[0] is set v d == 0s, (all elements false) then cr6[2] is set. for b , byte, integer length = 8 bits = 1 byte, compare sixteen signed integers from v a to sixteen signed integers from v b and place the results in the 16 corresponding elements in v d. for h , half word, integer length = 16 bits = 2 bytes, compare eight signed integers from v a to eight signed integers from v b and place the results in the 8 corresponding elements in v d. for w , word, integer length = 32 bits = 4 bytes, compare four signed integers from v a to four signed integers from v b and place the results in the 4 corresponding elements in v d. vector compare equal to unsigned integer [b,h,w] vcmpequb[.] vcmpequh[.] vcmpequw[.] v d ,v a ,v b compare the value in v a with the value in v b, treating the operands as unsigned integers. place the result of the comparison into the v d ?ld speci?d by operand v d. if v a = v b then v d =1s; otherwise v d = 0s. if the record bit (rc) is set in the vector compare instruction then v d == 1s, (all elements true) then cr6[0] is set v d == 0s, (all elements false) then cr6[2] is set. for b , byte, integer length = 8 bits =1 byte, compare sixteen unsigned integers from v a to sixteen unsigned integers from v b and place the results in the corresponding 16 elements in v d. for h , half word, integer length =16 bits = 2 bytes, compare eight unsigned integers from v a to eight unsigned integers from v b and place the results in the corresponding 8 elements in v d. for w , word, integer length=32 bits = 4 bytes, compare four unsigned integers from v a to four unsigned integers from v b and place the results in the corresponding 4 elements in v d. note: vcmpequb [ . ], vcmpequh [ . ], and vcmpequw [ . ] can use both unsigned and signed integers. table 4-3. vector integer compare instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-16 altivec technology programming environments manual motorola altivec uisa instructions 4.2.1.5 vector integer rotate and shift instructions the vector integer rotate instructions are summarized in table 4-5. the vector integer shift instructions are summarized in table 4-6. table 4-4. vector integer logical instructions name mnemonic syntax operation vector logical and vand v d, v a ,v b and the contents of v a with v b and place the result into v d. vector logical or v or v d, v a ,v b or the contents of v a with v b and place the result into v d. vector logical xor vxor v d, v a ,v b xor the contents of v a with v b and place the result into v d. vector logical and with complement vandc v d, v a ,v b and the contents of v a with the complement of v b and place the result into v d. vector logical nor vn or v d, v a ,v b nor the contents of v a a with v b and place the result into v d. table 4-5. vector integer rotate instructions name mnemonic syntax operation vector rotate left integer [b,h,w] vrlb vrlh vrlw v d ,v a ,v b rotate each element in v a left by the number of bits specified in the low-order log 2 (n ) bits of the corresponding element in v b. place the result into the corresponding element of v d. for b , byte, integer length = 8 bits = 1 byte, use 16 integers from v a with 16 integers from v b. for h , half word, integer length = 16 bits = 2 bytes, use 8 integers from v a with 8 i ntegers from v b. for w , word, integer length = 32 bits = 4 bytes, use 4 integers from v a with 4 integers from v b. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-17 altivec uisa instructions 4.2.2 vector floating-point instructions this section describes the vector ?ating-point instructions, which include the following: arithmetic rrounding and conversion compare estimate the altivec ?ating-point data format complies with the ansi/ieee-754 standard. a quantity in this format represents a signed normalized number, a signed denormalized number, a signed zero, a signed in?ity, a quiet not a number (qnan), or a signalling nan (snan). operations perform to a java/ieee/c9x-compliant subset of the ieee standard, table 4-6. vector integer shift instructions name mnemonic syntax operation vector shift left integer [b,h,w] vslb vslh vslw v d ,v a ,v b shift each element in v a left by the number of bits specified in the low-order log 2 ( n ) bits of the corresponding element in v b. if bits are shifted out of bit 0 of the element they are lost. supply zeros to the vacated bits on the right. place the result into the corresponding element of v d. for b , byte, integer length = 8 bits = 1 byte, use 16 integers from v a with 16 integers from v b. for h , half word, integer length = 16 bits = 2 bytes, use 8 integers from v a with 8 i ntegers from v b. for w , word, integer length = 32 bits = 4 bytes, use 4 integers from v a with 4 integers from v b. vector shift right integer [b,h,w] vsrb vsrh vsrw v d ,v a ,v b shift each element in v a right by the number of bits speci?d in the low-order log 2 ( n ) bits of the corresponding element in v b. if bits are shifted out of bit n ? of the element they are lost. supply zeros to the vacated bits on the left. place the result into the corresponding element of v d. for b , byte, integer length = 8 bits = 1 byte, use 16 integers from v a with 16 integers from v b. for h , half word, integer length = 16 bits = 2 bytes, use 8 integers from v a with 8 i ntegers from v b. for w , word, integer length = 32 bits = 4 bytes, use 4 integers from v a with 4 integers from v b. vector shift right algebraic integer [b,h,w] vsrab vsrah vsraw v d ,v a ,v b shift each element in v a right by the number of bits speci?d in the low-order log 2 ( n ) bits of the corresponding element in v b. if bits are shifted out of bit n ? of the element they are lost. replicate bit 0 of the element to fill the vacated bits on the left. place the result into the corresponding element of v d. for b , byte, integer length = 8 bits = 1 byte, use 16 integers from v a with 16 integers from v b. for h , half word, integer length = 16 bits = 2 bytes, use 8 integers from v a with 8 i ntegers from v b. for w , word, integer length = 32 bits = 4 bytes, use 4 integers from v a with 4 integers from v b. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-18 altivec technology programming environments manual motorola altivec uisa instructions for further details on the java or non-java mode see section 3.2.1, ?loating-point modes. altivec isa does not report ieee exceptions but rather produces default results as speci?d by the java/ieee/c9x standard. for further details on exceptions, see section 3.2.4, ?loating-point exceptions. 4.2.2.1 floating-point division and square-root altivec instructions do not have division or square-root instructions. altivec isa implements vector reciprocal estimate floating-point ( vrefp ) and vector reciprocal-square-root estimate floating-point ( vrsqrtefp ) instructions along with a vector negative multiply-subtract floating-point ( vnmsubfp ) instruction assisting in the newton-raphson re?ement of the estimates. to accomplish division, simply multiply the dividend (x/y = x * 1/y) and square-root by multiplying the original number ( x = x * 1/ x). in this way, altivec isa provides inexpensive divides and square-roots that are fully pipelined, sub-operation scheduled, and faster even than many hardware dividers. software methods are available to further re?e these to correct ieee results. 4.2.2.1.1 floating-point division the newton-raphson re?ement step for the reciprocal 1 / b looks like this: y1 = y0 + y0*(1 - b*y0), where y0 = recip_est(b) this is implemented in the altivec isa as follows: y0 = vrefp(b) t = vnmsubfp(y0,b,1) y1 = vmaddfp(y0,t,y0) this produces a result accurate to almost 24 bits of precision, except where b is a suf?iently small denormalized number that vrefp generates an in?ity that, if important, must be explicitly guarded against. to get a correctly rounded ieee quotient from the above result, a second newton-raphson iteration is performed to get a correctly rounded reciprocal (y2) to the required 24 bits of precision, then the residual. r = a - b*q is computed with vnmsubfp (where a is the dividend, b the divisor, and q an approximation of the quotient from a*y2). the correctly rounded quotient can then be obtained. q' = q + r*y2 the additional accuracy provided by the fused nature of the altivec instruction multiply-add is essential to producing the correctly rounded quotient by this method. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-19 altivec uisa instructions the second newton-raphson iteration may ultimately not be needed but more work must be done to show that the absolute error after the ?st re?ement step would always be less than 1 ulp, which is a requirement of this method. 4.2.2.1.2 floating-point square-root the newton-raphson re?ement step for reciprocal square root looks like the following: y1 = y0 + 0.5*y0*(1 - b*y0*y0), where y0 = recip_sqrt_est(b) that can be implemented as follows: y0 = vrsqrtefp(b) t0 = vmaddfp(y0,y0,0.0) t1 = vmaddfp(y0,0.5,0.0) t0 = vnmsubfp(b,t0,1) y1 = vmaddfp(t0,t1,y0) various methods can further re?e a correctly rounded ieee result, all more elaborate than the simple residual correction for division, and therefore are not presented here, but most of which also bene? from the negative multiply-subtract instruction. 4.2.2.2 floating-point arithmetic instructions the ?ating-point arithmetic instructions are summarized in table 4-7. table 4-7. floating-point arithmetic instructions name mnemonic syntax operation vector add floating-p oint vaddfp v d ,v a , v b add the 4-word (32-bit) ?ating-point elements in v a to the 4-word (32-bit) ?ating-point elements in v b. round the four intermediate results to the nearest single-precision number and placed into v d. vector subtract floating-p oint vsubfp v d ,v a , v b the 4-word (32-bit) ?ating-point values in v b are subtracted from the 4 32-bit values in v b. the four intermediate results are rounded to the nearest single-precision ?ating-point and placed into v d. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-20 altivec technology programming environments manual motorola altivec uisa instructions 4.2.2.3 floating-point multiply-add instructions vector multiply-add instructions are critically important to performance because multiply followed by a data-dependent addition is the most common idiom in dsp algorithms. in most implementations, ?ating-point multiply-add instructions perform with the same latency as either a multiply or add alone, thus doubling performance in comparing to the otherwise serial multiply and adds. this will make performance twice as fast as using separate multiply and add instructions. altivec ?ating-point multiply-adds instructions fuse (a multiply-add fuse implies that the full product participates in the add operation without rounding; only the ?al result rounds). this not only simpli?s the implementation and reduces latency (by eliminating the intermediate rounding) but also increases the accuracy compared to separate multiply and adds. be careful as java-compliant programs can not use multiply-add instructions fused directly because java requires both the product and sum to round separately. thus to achieve strict java compliance, perform the multiply and add with separate instructions. to realize multiply in altivec isa use multiply-add instructions with a zero addend (for example, vmaddfp v d, v a, v c, v b where ( v b = 0.0). note that to use multiply-add instructions to perform an ieee- or java-compliant multiply, the addend must be -0.0. this is necessary to ensure that the sign of a zero result is correct when the product is either +0.0 or -0.0 (+0.0 + -0.0 ? +0.0, and -0.0 + -0.0 ? -0.0). when the sign of a resulting 0.0 is not important, then use +0.0 as the addend that may, in some vector maximum floating-p oint vmaxfp v d ,v a , v b compare each of the 4 single-precision word elements in v a to the corresponding 4 single-precision word elements in v b and place the larger value within each pair into the corresponding word element in v d. vmaxfp is sensitive to the sign of 0.0. when both operands are ?.0: max(+0.0,?.0) = max(?.0,+0.0) ? +0.0 max(-0.0,-0.0) ? -0.0 max(nan,x) ? qnan, where x = any value vector minimum floating-p oint vminfp v d ,v a , v b compare each of the 4 single-precision word elements in v a to the corresponding 4 single-precision word elements in v b for each of the four elements, place the smaller value within each pair into v d. vminfp is sensitive to the sign of 0.0. when both operands are ?.0: min(-0.0,?.0) = min(?.0,-0.0) ? -0.0 min(+0.0,+0.0) ? +0.0 min(nan,x) ? qnan where x = any value table 4-7. floating-point arithmetic instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-21 altivec uisa instructions cases, avoiding the need for a second register to hold a -0.0 in addition to the integer 0/?ating-point +0.0 that may already be available. the ?ating-point multiply-add instructions are summarized in table 4-8. 4.2.2.4 floating-point rounding and conversion instructions all altivec ?ating-point arithmetic instructions use the ieee default rounding mode, round-to-nearest. altivec isa does not provide the ieee directed rounding modes. altivec isa provides separate instructions for converting ?ating-point numbers to integral ?ating-point values for all ieee rounding modes as follows: round-to-nearest ( vr? ) (round) round-toward-zero ( vr? ) (truncate) round-toward-minus-in?ity ( vr? ) (?or) round-toward-positive-in?ity ( vr? ) (ceiling). floating-point conversions to integers ( vctuxs , vctsxs ) use round-toward-zero (truncate). the ?ating-point rounding instructions are described in table 4-9. table 4-8. floating-point multiply-add instructions name mnemonic syntax operation vector multiply- add floating-p oint vmaddfp v d ,v a ,v c ,v b multiply the four word ?ating-point elements in v a by the corresponding four word elements in v c. add the four word elements in v b to the four intermediate products. round the results to the nearest single-precision numbers and place the corresponding word elements into v d. vector negative multiply- subtract floating-p oint vnmsubfp v d ,v a ,v c ,v b multiply the four word ?ating-point elements in v a by the corresponding four word elements in v c. subtract the four word ?ating-point elements in v b from the four intermediate products and invert the sign of the difference. round the results to the nearest single-precision numbers and place the corresponding word elements into v d. table 4-9. floating-point rounding and conversion instructions name mnemonic syntax operation vector round to floating-point integer nearest vr? v d ,v b round to the nearest the four word ?ating-point elements in v b and place the four corresponding word elements into v d. vector round to floating-point integer toward zero vr? v d ,v b round towards zero the four word ?ating-point elements in v b and place the four corresponding word elements into v d. vector round to floating-point integer toward positive in?ity vr? v d ,v b round towards +in?ity the four word ?ating-point elements in v b and place the four corresponding word elements into v d. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-22 altivec technology programming environments manual motorola altivec uisa instructions 4.2.2.5 floating-point compare instructions this section describes ?ating-point unordered compare instructions. all altivec ?ating-point compare instructions ( vcmpeqfp , vcmpgtfp , vcmpgefp , and vcmpbfp ) return false if either operand is a nan. not equal-to, not greater-than, not greater-than-or-equal-to, and not-in-bounds nans compare to everything, including themselves. compares always return a boolean mask (true = 0xffff_ffff, false = 0x0000_0000) and never return a nan. the vcmpeqfp instruction is recommended as the isnan( v x) test. no explicit unordered compare instructions or traps are provided. however, the greater-than-or-equal-to predicate ( ) ( vcmpgefp ) is provided?n addition to the > and = predicates available for integer comparison?peci?ally to enable ieee unordered comparison that would not be possible with just the > and = predicates. table 4-10 lists the six common mathematical predicates and how they would be realized in altivec code. vector round to floating-point integer toward minus in?ity vr? v d ,v b round towards -in?ity the four word ?ating-point elements in v b and place the four corresponding word elements into v d. vector convert from unsigned fixed-point word vcfux v d ,v b, uimm convert each of the four unsigned ?ed-point integer word elements in v b to the nearest single-precision value. divide the result by 2 uimm and place into the corresponding word element of v d. vector convert from signed fixed-point word vcfsx v d ,v b, uimm convert each signed ?ed-point integer word element in v b to the nearest single-precision value. divide the result by 2 uimm and place into the corresponding word element of v d. vector convert to unsigned fixed-point word saturate vctuxs v d ,v b, uimm multiply each of the four single-precision word elements in v b by 2 uimm . the products are converted to unsigned ?ed-point integers using the round toward zero mode. if the intermediate results are > 2 32 ? saturate to 2 32 ? and if it is < 0 saturate to 0. place the unsigned integer results into the corresponding word elements of v d. vector convert to signed fixed-point word saturate vctsxs v d ,v b, uimm multiply each of the four single-precision word elements in v b by 2 uimm . the products are converted to signed ?ed-point integers using round toward zero mode. if the intermediate results are > 2 32 ? saturate to 2 32 ? and if it is < ? 31 saturate to ? 31 . place the unsigned integer results into the corresponding word elements of v d. table 4-9. floating-point rounding and conversion instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-23 altivec uisa instructions table table 4-11 shows the remaining eight useful predicates and how they might be realized in altivec code. the vector ?ating-point compare instructions compare the elements in two vector registers word-by-word, interpreting the elements as single-precision numbers. with the exception of the vector compare bounds floating-point ( vcmpbfp ) instruction they set the target vector register, and cr[6] if rc = 1, in the same manner as do the vector integer compare instructions. the vector compare bounds floating-point ( vcmpbfp ) instruction sets the target vector register, and cr[6] if rc = 1, to indicate whether the elements in v a are within the bounds speci?d by the corresponding element in v b, as explained in the instruction description. a table 4-10. common mathematical predicates case mathematical predicate altivec realization relations a>b a) ?(a = b) ttft 3 a > b a > b tfff 4 a < b b > a ftff 5a b ?(b > a) t f t *t 6a b ?(a > b) f t t *t 5a a ba b tftf 6a a bb a fttf * note : cases 5 and 6 implemented with greater-than ( vcmpgtfp and vnor ) would not yield the correct ieee result when the relation is unordered. table 4-11. other useful predicates case predicate altivec realization relations a>b aa) (a>b)) ffft 8 a <> b (a b) (b a) ttff 9 a <=> b (a b) (b a) tttf 10 a ?> b ?(b a) tfft 11 a ?>= b ?(b > a) tftt 12 a ?< b ?(a b) ftft 13 a ?<= b ?(a > b) fttt 14 a ?= b ?((a > b) (b > a)) fftt f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-24 altivec technology programming environments manual motorola altivec uisa instructions single-precision value x is said to be within the bounds speci?d by a single-precision value y if (-y x y). the ?ating-point compare instructions are summarized in table 4-12. 4.2.2.6 floating-point estimate instructions the ?ating-point estimate instructions are summarized in table 4-13. table 4-12. floating-point compare instructions name mnemonic syntax operation vector compare greater than floating-p oint [record] vcmpgtfp[.] v d ,v a ,v b compare each of the four single-precision word elements in v a to the corresponding four single-precision word elements in v b for each element, if v a > v b then set the corresponding element in v d to all 1s otherwise clear the element in v d to all 0s if the record bit is set (rc = 1) in the vector compare instruction, then v d ==1, (all elements true) then cr6[0] is set v d == 0, (all elements false) then cr6[2] is set vector compare equal to floating-p oint [record] vcmpeqfp[.] v d ,v a ,v b compare each of the 4 single-precision word elements in v a to the corresponding 4 single-precision word elements in v b. for each element, if v a = v b then set the corresponding element in v d to all 1s otherwise clear the element in v d to all 0s if the record bit is set (rc = 1) in the vector compare instruction then v d ==1, (all elements true) then cr6[0] is set v d == 0, (all elements false) then cr6[2] is set vector compare greater than or equal to floating-p oint [record] vcmpgefp[.] v d ,v a ,v b compare each of the 4 single-precision word elements in v a to the corresponding 4 single-precision word elements in v b. for each element, if v a >= v b then set the corresponding element in v d to all 1s otherwise clear the element in v d to all 0s if the record bit is set (rc = 1) in the vector compare instruction then v d ==1, (all elements true) then cr6[0] is set v d == 0, (all elements false) then cr6[2] is set vector compare bounds floating-p oint [record] vcmpbfp[.] v d ,v a ,v b compare each of the 4 single-precision word elements in v a to the corresponding single-precision word elements in v b. a 2-bit value is formed that indicates whether the element in v a is within the bounds speci?d by the element in v b, as follows. bit 0 of the two-bit value is cleared if the element in v a is <= to the element in v b, and is set otherwise. bit 1 of the two-bit value is cleared if the element in v a is >= to the negation of the element in v b, and is set otherwise. the two-bit value is placed into the high-order two bits of the corresponding word element of v d and the remaining bits of the element are cleared to 0. if rc = 1, cr6[2] is set when all four elements in v a are within the bounds speci?d by the corresponding element in v b f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-25 altivec uisa instructions 4.2.3 load and store instructions only very basic load and store operations are provided in altivec isa. this keeps the circuitry in the memory path fast so the latency of memory operations will be low. instead, a powerful set of ?ld manipulation instructions are provided to manipulate data into the desired alignment and arrangement after the data has been brought into the vector registers. load vector indexed ( lvx , lvxl ) and store vector indexed ( stvx , stvxl ) instructions transfer an aligned quad-word vector between memory and vector registers. load vector element indexed ( lvebx , lvehx , lvewx ) and store vector element indexed instructions ( stvebx , stvehx , stvewx ) transfer byte, half-word, and word scalar elements between memory and vector registers. all vector loads and vector stores use the index ( r a|0 + r b) addressing mode to specify the target memory address. altivec isa does not provide any update forms. an lvebx , lvehx , or lvewx instruction transfers a scalar data element from memory into the destination vector register, leaving other elements in the vector with boundedly-unde?ed values. a stvebx , stvehx , or stvewx instruction transfers a scalar data element from the source vector register to memory leaving other elements in the quad word unchanged. no data alignment occurs, that is, all scalar data elements are transferred directly on their natural memory byte-lanes to or from the corresponding element in the vector register. quad word memory accesses made by lvx , lvxl , stvx , and stvxl instructions are not guaranteed to be atomic. direct-store segments (t=1) are not supported by altivec isa. any vector load or store that attempts to access a direct-store segment will cause a dsi exception. table 4-13. floating-point estimate instructions name mnemonic syntax operation vector reciprocal estimate floating-point vrefp v d, v b place estimates of the reciprocal of each of the four word ?ating-point source elements in v b in the corresponding four word elements in v d. vector reciprocal square root estimate floating-point vrsqrtefp v d, v b place estimates of the reciprocal square-root of each of the four word source elements in v b in the corresponding four word elements in v d. vector log2 estimate floating-point vlogefp v d ,v b place estimates of the base 2 logarithm of each of the four word source elements in v b in the corresponding four word elements in v d. vector 2 raised to the exponent estimate floating-point vexptefp v d ,v b place estimates of 2 raised to the power of each of the four word source elements in v b in the corresponding four word elements in v d. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-26 altivec technology programming environments manual motorola altivec uisa instructions 4.2.3.1 alignment all memory references must be size aligned. if a vector load or store address is not properly size aligned, the suitable number of least signi?ant bits are ignored, and a size aligned transfer occurs instead. data alignment must be performed by software after being brought into the registers. no assistance is provided for aligning individual scalar elements that are not aligned on their natural size boundary. however, assistance is provided for justifying non-size-aligned vectors. this is provided through the load vector for shift left ( lvsl ) and load vector for shift right ( lvsr ) instructions that compute the proper vector permute ( vperm ) control vector from the misaligned memory address. for details on how to use these instructions to align data see section 3.1.6, ?uad-word data alignment.? the lvx , lvxl , stvx , and stvxl instructions can be used to move data, not just multimedia data, in powerpc environments. therefore, because vector loads and stores are size-aligned, care should be taken to align data on even quad-word boundaries for maximum performance. 4.2.3.2 load and store address generation vector load and store operations generate effective addresses using register indirect with index mode. all altivec load and store instructions use register indirect with index addressing mode that cause the contents of two gprs (specified as operands r a and r b) to be added in the generation of the effective address (ea). a zero in place of the r a operand causes a zero to be added to the value specified by r b. the option to specify r a or 0 is shown in the instruction descriptions as ( r a|0). if the address becomes misaligned, for a half word, word, or quad word when combining addresses ( r a|0 + r b), the effective address is anded with the appropriate zero values to boundary align the address and is summarized in table 4-14. table 4-14. effective address alignment operand effective address bit setting indexed half word ea[63] 0b0 indexed word ea[62?3] 0b00 indexed quad word ea[60?3] 0b0000 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-27 altivec uisa instructions figure 4-1 shows how an effective address is generated when using register indirect with index addressing. figure 4-1. register indirect with index addressing for loads/stores 4.2.3.3 vector load instructions for vector load instructions, the byte, half word, or word addressed by the ea (effective address) is loaded into r d. the default byte and bit ordering is big-endian as in the powerpc architecture; see section 3.1.2, altivec byte ordering,?for information about little-endian byte ordering. no 063 gpr ( r a) 0 + 063 vr ( v d) memory interface store load ye s 063 gpr ( r b) instruction encoding: r a=0? 063 effective address 0 5 6 1011 15 16 20 21 30 31 opcode v d/ v s r a r b subopcode 0 reserved boundary align ea f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-28 altivec technology programming environments manual motorola altivec uisa instructions table 4-15 summarizes the vector load instructions. the lvsl and lvsr instructions can be used to create the permute control vector to be used by a subsequent vperm instruction. let x and y be the contents of v a and v b speci?d by vperm . the control vector created by lvsl causes the vperm to select the high-order 16 bytes of the result of shifting the 32-byte value x || y left by sh bytes (sh = the value in ea[60-63]). the control vector created by lvsr causes the vperm to select the low-order 16 bytes of the result of shifting x || y right by sh bytes. these instructions can also be used to rotate or shift the contents of a vector register left lvsl or right lvsr by sh bytes. the sh values for the lvsl instruction are shown in table 4-17, and those for the lvsr instruction are shown in table 4-18.for rotating, the vector register to be rotated should be speci?d as both the v a and the v b register for vperm . for shifting left, the v b register for vperm should be a register containing all zeros and v a should contain the value to be shifted, and vice versa for shifting right. for further examples on how to align the data see section 3.1.6, ?uad-word data alignment.?the default byte and bit table 4-15. integer load instructions name mnemonic syntax operation load vector element integer indexed [b,h,w] lvebx lvehx lvewx v d ,r a ,r b the ea is the sum ( r a|0) + ( r b). load the byte, half word, or word in memory addressed by the ea into the low-order bits of v d. the remaining bits in vd are set to boundedly unde?ed values. because memory must stay aligned, the ea is set to default to alignment: for b , byte, integer length = 8 bits = 1 byte, for h , half word, integer length = 16 bits = 2 bytes, ea[62?3] is set to 0b0 for w , word, integer length = 32 bits = 4 bytes, ea[61-63] is set to 0b00 load vector indexed lvx v d ,r a ,r b the ea is the sum ( r a|0) + ( r b). load the double word in memory addressed by the ea into v d. because memory needs to stay aligned, the ea is set to default to alignment: for a quad word, integer length = 128 bits = 8 bytes, ea[60?3] is set to 0b0000 lru = 0 if the processor is in little-endian mode, load the double word in memory addressed by ea into v d[64?27] and load the double word in memory addressed by ea+8 into v d[0?3]. load vector indexed lru lvxl v d ,r a ,r b the ea is the sum ( r a|0) + ( r b). load the double word in memory addressed by the ea into v d. for the double word, integer length = 64 bits = 4 bytes, the ea[60?3] is set to 0b0000 lru =1, least recently used, hints that the quad word in the memory addressed by ea will probably not be needed again by the program in the near future. if the processor is in little-endian mode, load the double word in memory addressed by ea into v d[64?27] and load the double word in memory addressed by ea+8 into v d[0?3]. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-29 altivec uisa instructions ordering is big-endian as in the powerpc architecture; see section 3.1.2.2, ?ittle-endian byte ordering,?for information about little-endian byte ordering. table 4-16 summarizes the vector alignment instructions. table 4-16. vector load instructions supporting alignment name mnemonic syntax operation load vector for shift left lvsl v d ,r a ,r b the ea is the sum ( r a|0) + ( r b). the ea[60?3] = sh, then based ontable 4-17, place the value in v d load vector for shift right lvsr v d ,r a ,r b the ea is the sum ( r a|0) + ( r b). the ea[60?3] = sh, then based on table 4-18, place the value in v d table 4-17. shift values for lvsl instruction shift (sh) vd[0-127] 0x0 0x000102030405060708090a0b0c0d0e0f 0x1 0x0102030405060708090a0b0c0d0e0f10 0x2 0x02030405060708090a0b0c0d0e0f1011 0x3 0x0d0e0f101112131415161718191a1b1c 0x4 0x0405060708090a0b0c0d0e0f10111213 0x5 0x05060708090a0b0c0d0e0f1011121314 0x6 0x060708090a0b0c0d0e0f101112131415 0x7 0x0708090a0b0c0d0e0f10111213141516 0x8 0x08090a0b0c0d0e0f1011121314151617 0x9 0x090a0b0c0d0e0f101112131415161718 0xa 0x0a0b0c0d0e0f10111213141516171819 0xb 0x0b0c0d0e0f101112131415161718191a 0xc 0x0c0d0e0f101112131415161718191a1b 0xd 0x0d0e0f101112131415161718191a1b1c 0xe 0x0e0f101112131415161718191a1b1c1d 0xf 0x0f101112131415161718191a1b1c1d1e table 4-18. shift values for lvsr instruction shift (sh) vd[0-127] 0x0 0x101112131415161718191a1b1c1d1e1f 0x1 0x0f101112131415161718191a1b1c1d1e 0x2 0x0e0f101112131415161718191a1b1c1d 0x3 0x0d0e0f101112131415161718191a1b1c 0x4 0x0c0d0e0f101112131415161718191a1b f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-30 altivec technology programming environments manual motorola altivec uisa instructions 4.2.3.4 vector store instructions for vector store instructions, the contents of vector register used as a source ( v s) are stored into the byte, half word, word or quad word in memory addressed by the effective address (ea). table 4-19 provides a summary of the vector store instructions. 0x5 0x0b0c0d0e0f101112131415161718191a 0x6 0x0a0b0c0d0e0f10111213141516171819 0x7 0x090a0b0c0d0e0f101112131415161718 0x8 0x08090a0b0c0d0e0f1011121314151617 0x9 0x0708090a0b0c0d0e0f10111213141516 0xa 0x060708090a0b0c0d0e0f101112131415 0xb 0x05060708090a0b0c0d0e0f1011121314 0xc 0x0405060708090a0b0c0d0e0f10111213 0xd 0x030405060708090a0b0c0d0e0f101112 0xe 0x02030405060708090a0b0c0d0e0f1011 0xf 0x0102030405060708090a0b0c0d0e0f10 table 4-19. integer store instructions name mnemonic syntax operation store vector element integer indexed [b,h,w] stvebx stvehx stvewx v s ,r a ,r b the ea is the sum ( r a|0) + ( r b). store the contents of the low-order bits of v s into the integer in memory addressed by the ea. because memory needs to stay aligned, the ea is set to default to alignment: for b , byte, integer length = 8 bits =1 byte, for h , half word, integer length = 16 bits = 2 bytes, ea[62?3] is set to 0b0 for w , word, integer length = 32 bits = 4 bytes, ea[61?3] is set to 0b00 table 4-18. shift values for lvsr instruction (continued) shift (sh) vd[0-127] f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-31 altivec uisa instructions 4.2.4 control flow altivec instructions can be freely intermixed with existing powerpc instructions to form a complete program. altivec instructions do provide a vector compare and select mechanism to implement conditional execution as a mechanism to control data ?w in altivec programs. and altivec vector compare instructions can update the condition register thus providing the communication from altivec execution units to powerpc branch instructions necessary to modify program ?w based on vector data. 4.2.5 vector permutation and formatting instructions vector pack, unpack, merge, splat, permute, and select can be used to accelerate various vector math and vector formatting. details of the various instructions follow. 4.2.5.1 vector pack instructions half-word vector pack instructions ( vpkuhum , vpkuhus , vpkshus , vpkshss ) truncate the sixteen half words from two concatenated source operands producing a single result of sixteen bytes (quad word) using either modulo(2 8 ), 8-bit signed-saturation, or 8-bit unsigned-saturation to perform the truncation. similarly, word vector pack instructions ( vpkuwum , vpkuwus , vpkswus , and vpksws ) truncate the eight words from two concatenated source operands producing a single result of eight half words using store vector indexed stvx v s ,r a ,r b the ea is the sum ( r a|0) + ( r b). store the contents of v s into the quad word in memory addressed by the ea. for q , quad word, integer length = 64 bits = 4 bytes, the ea[60?3] is set to 0b0000 lru = 0 if the processor is in little-endian mode, store the contents of v s[64?27] into the double word in memory addressed by ea, and store the contents of v s[0?3] into the double word in memory addressed by ea+8. store vector indexed lru stvxl v d ,r a ,r b the ea is the sum ( r a|0) + ( r b). store the contents of v s into the quad word in memory addressed by the ea. for d , double word, integer length=64 bits = 4 bytes, the ea[60?3] is set to 0b0000 lru = 1, least recently used, hints that the quad word in the memory addressed by ea will probably not be needed again by the program in the near future. if the processor is in little-endian mode, store the contents of v s[64?27] into the double word in memory addressed by ea, and store the contents of v s[0?3] into the double word in memory addressed by ea+8. table 4-19. integer store instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-32 altivec technology programming environments manual motorola altivec uisa instructions modulo(2^16), 16-bit signed-saturation, or 16-bit unsigned-saturation to perform the truncation. one special form of vector pack pixel ( vpkpx ) instruction packs eight 32-bit (8/8/8/8) pixels from two concatenated source operands into a single result of eight 16-bit 1/5/5/5 rgb pixels. the least signi?ant bit of the ?st 8-bit element becomes the 1-bit ?ld, and each of the three 8-bit r, g, and b ?lds are reduced to 5 bits by ignoring the 3 lsbs. table 4-20 describes the vector pack instructions. table 4-20. vector pack instructions name mnemonic syntax operation vector pack unsigned integer [h,w] unsigned modulo vpkuhum vpkuwum v d , v a , v b concatenate the low-order unsigned integers of v a and the low-order unsigned integers of v b and place into v d using unsigned modulo arithmetic. v a is placed in the lower order double word of v d and v b is placed into the higher order double word of v d. for h , half word, integer length = 16 bits = 2 bytes, eight unsigned integers, in other words the 8 low-order bytes of the half words from v a and v b for w , word, integer length = 32 bits = 4 bytes, four unsigned integers, in other words the 4 low-order half words of the words from v a and v b vector pack unsigned integer [h,w] unsigned saturate vpkuhus vpkuwus v d , v a , v b concatenate the low-order unsigned integers of v a and the low-order unsigned integers of v b and place into v d using unsigned saturate clamping mode. v a is placed in the lower order double word of v d and v b is placed into the higher order double word of v d. for h , half word, integer length = 16 bits = 2 bytes, eight unsigned integers, in other words the 8 low-order bytes of the half words from v a and v b for w , word, integer length = 32 bits = 4 bytes,four unsigned integers, in other words the 4 low-order words of the half words from v a and v b vector pack signed integer [h,w] unsigned saturate vpkshus vpkswus v d , v a , v b concatenate the low-order signed integers of v a and the low-order signed integers of v b and place into v d using unsigned saturate clamping mode. v a is placed in the lower order double word of v d and v b is placed into the higher order double word of v d. for h , half word, integer length = 16 bits = 2 bytes, eight signed integers, in other words the 8 low-order bytes of the half word from v a and v b for w , word, integer length = 32 bits = 4 bytes, four signed integers, in other words the 4 low-order half words of the words from v a and v b vector pack signed integer [h,w] signed saturate vpkshss vpkswss v d , v a , v b concatenate the low-order signed integers of v a and the low-order signed integers of v b are concatenated and place into v d using signed saturate clamping mode. v a is placed in the lower order double word of v d and v b is placed into the higher order double word of v d. for h , half word, integer length = 16 bits = 2 bytes, eight signed integers, in other words the 8 low-order bytes of the half word from v a and v b for w , word, integer length = 32 bits = 4 bytes, four signed integers, in other words the 4 low-order half words of the words from v a and v b f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-33 altivec uisa instructions 4.2.5.2 vector unpack instructions byte vector unpack instructions unpack the 8 low bytes (or 8 high bytes) of one source operand into 8 half words using sign extension to ?l the msbs. half word vector unpack instructions unpack the 4 low half words (or 4 high half words) of one source operand into 4 words using sign extension to ?l the msbs. a special purpose form of vector unpack is provided, the vector unpack low pixel ( vupklpx ) and the vector unpack high pixel ( vupkhpx ) instructions for 1/5/5/5 rgb pixels. the 1/5/5/5 pixel vector unpack, unpacks the four low 1/5/5/5 pixels (or four 1/5/5/5 high pixels) into four 32-bit (8/8/8/8) pixels. the 1-bit element in each pixel is sign extended to 8 bits, and the 5-bit r, g, and b elements are each zero extended to 8 bits. table 4-21 describes the unpack instructions. vector pack pixel vpkpx v d , v a , v b each word element in v a and v b is packed to 16 bits and the half word is placed into v d. each word from v a and v b is packed to 16 bits in the following order: [bit 7 of the ?st byte (bit 7 of the word)] [bits 0? of the second byte (bits 8?2 of the word) [bits 0? of the third byte (bits 16?0 of the word)] [bits 0? of the fourth byte (bits 24?8 of the word)] v a half words are placed in the lower order double word of v d and v b half words are placed into the higher order double word of v d. for h , half word, integer length = 16 bits = 2 bytes, eight signed integers, in other words the 8 low-order bytes of the half word from v a and v b for w , word, integer length = 32 bits = 4 bytes, four signed integers, in other words the 4 low-order half words of the words from v a and v b table 4-20. vector pack instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-34 altivec technology programming environments manual motorola altivec uisa instructions 4.2.5.3 vector merge instructions byte vector merge instructions interleave the 8 low bytes (or 8 high bytes) from two source operands producing a result of 16 bytes. similarly, half-word vector merge instructions interleave the 4 low half words (or 4 high half words) of two source operands producing a result of 8 half words, and word vector merge instructions interleave the 2 low words (or 2 high words) from two source operands producing a result of 4 words. the vector merge instruction has many uses, notable among them is a way to ef?iently transpose simd vectors. table 4-22 describes the merge instructions. table 4-21. vector unpack instructions name mnemonic syntax operation vector unpack high signed integer [b,h] vupkhsb vupkhsh v d , v b each signed integer element in the high order double word of v b is sign extended to ?l the msbs in a signed integer and then is placed into v d. for b , byte, integer length = 8 bits = 1 byte, eight signed bytes from the high order double word of v b are unpacked and sign extended to 8 half words into v d. for h , half word, integer length = 16 bits = 2 bytes, eight signed half words from the high order double word of v b are unpacked and sign extended to 4 words into v d vector unpack high pixel vupkhpx v d , v b each half-word element in the high order double word of v b is unpacked to produce a 32-bit word that is then placed in the same order into v d. a half-word element is unpacked to 32 bits by concatenating, in order, the results of the following operations. sign-extend bit 0 of the half word to 8 bits zero-extend bits 1? of the half word to 8 bits zero-extend bits 6?0 of the half word to 8 bits zero-extend bits 11?5 of the half word to 8 bits vector unpack low signed integer [b,h] vupklsb vupklsh v d , v b each signed integer element in the low-order double word of v b is sign extended to ?l the msbs in a signed integer and then is placed into v d. for b , byte, integer length = 8 bits = 1 byte, eight signed bytes from the low-order double word of v b are unpacked and sign extended to 8 half words into v d. for h , half word, integer length = 16 bits = 2 bytes, eight signed half words from the low-order double word of v b are unpacked and sign extended into 4 words in v d vector unpack low pixel vupklpx v d , v b each half-word element in the low-order double word of v b is unpacked to produce a 32-bit word that is then placed in the same order into v d. a half-word element is unpacked to 32 bits by concatenating, in order, the results of the following operations. sign-extend bit 0 of the half word to 8 bits zero-extend bits 1? of the half word to 8 bits zero-extend bits 6?0 of the half word to 8 bits zero-extend bits 11?5 of the half word to 8 bits f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-35 altivec uisa instructions 4.2.5.4 vector splat instructions when a program needs to perform arithmetic vector, the vector splat instructions can be used in preparation for performing arithmetic for which one source vector is to consist of elements that all have the same value (for example, multiplying all elements of a vector register by a constant). vector splat instructions can be used to move data where it is required. for example to multiply all elements of a vector register by a constant, the vector splat instructions can be used to splat the scalar into the vector register. likewise, when storing a scalar into an arbitrary memory location, it must be splatted into a vector register, and that register must be speci?d as the source of the store. this will guarantee that the data appears in all possible positions of that scalar size for the store. table 4-23 describes the vector splat instructions. table 4-22. vector merge instructions name mnemonic syntax operation vector merge high integer [b,h,w] vmrghb vmrghh vmrghw v d , v a , v b each integer element in the high order double word of v a is placed into the low-order integer element in v d. each integer element in the high order double word of v b is placed into the high order integer element in v d. for b , byte, integer length = 8 bits = 1 byte, 8 bytes from the high order double word of v a are placed into the low-order byte of each half word in v d and 8 bytes from the high order double word of v b are placed into the high order byte of each half word in v d. for h , half word, integer length = 16 bits = 2 bytes, 4 half words from the high order double word of v a are placed into the low-order half word of each word in v d and 4 half words from the high order double word of v b are placed into the high order half word of each word in v d. for w , word, integer length = 32 bits = 4 bytes, 2 words from the high order double word of v a are placed into the low-order word of each double word in v d and 2 words from the high order double word of v b are placed into the high order word of each double word in v d. vector merge low integer [b,h,w] vmrglb vmrglh vmrglw v d , v a , v b each integer element in the low-order double word of v a is placed into the low-order integer element in v d. each integer element in the low-order double word of v b is placed into the high order integer element in v d. for b , byte, integer length = 8 bits = 1 byte, 8 bytes from the low-order double word of v a are placed into the low-order byte of each half word in v d and 8 bytes from the low-order double word of v b are placed into the high order byte of each half word in v d. for h , half word, integer length = 16 bits = 2 bytes, 4 half words from the low-order double word of v a are placed into the low-order half word of each word in v d and 4 half words from the low-order double word of v b are placed into the high order half word of each word in v d. for w , word, integer length = 32 bits = 4 bytes, 2 words from the low-order double word of v a are placed into the low-order word of each double word in v d and 2 words from the low-order double word of v b are placed into the high order word of each double word in v d. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-36 altivec technology programming environments manual motorola altivec uisa instructions 4.2.5.5 vector permute instruction permute instructions allow any byte in any two source vector registers to be directed to any byte in the destination vector. the ?lds in a third source operand specify from which ?ld in the source operands the corresponding destination ?ld will be taken. the vector permute ( vperm ) instruction is a very powerful one that provides many useful functions. for example, it provides a good way to perform table-lookups and data alignment operations. an example of how to use the command in aligning data see section 3.1.6, ?uad-word data alignment.?table 4-24 describes the vector permute instruction. 4.2.5.6 vector select instruction data ?w in the vector unit can be controlled without branching by using a vector compare and the vector select ( vsel ) instructions. in this use, the compare result vector is used directly as a mask operand to vector select instructions.the vsel instruction selects one ?ld from one or the other of two source operands under control of its mask operand. use of the true/false compare result vector with select in this manner produces a two instruction equivalent of conditional execution on a per-?ld basis. table 4-25 describes the vsel instruction. table 4-23. vector splat instructions name mnemonic syntax operation vector splat integer [b,h,w] vspltb vsplth vspltw v d, v b, uimm replicate the contents of element uimm in v b and place into each element in v d. for b , byte, integer length = 8 bits = 1 byte, each element is a byte. for h , half word, integer length = 16 bits = 2 bytes, each element is a half word. for w , word, integer length = 32 bits = 4 bytes, 2 words each element is a word. vector splat immediate signed integer [b,h,w] vspltisb vspltish vspltisw v d, simm sign-extend the value of the simm ?ld to the length of the element and replicate that value and place into each element in v d. for b , byte, integer length = 8 bits = 1 byte, each element is a byte. for h , half word, integer length = 16 bits = 2 bytes, each element is a half word. for w , word, integer length = 32 bits = 4 bytes, 2 words each element is a word. table 4-24. vector permute instruction name mnemonic syntax operation vector permute vperm v d , v a ,v b ,v c v c speci?s which bytes from v a and v b are to be copied and placed into the byte elements in v d. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-37 altivec uisa instructions 4.2.5.7 vector shift instructions the vector shift instructions shift the contents of a vector register or of a pair of vector registers left or right by a speci?d number of bytes ( vslo , vsro , vsldoi ) or bits ( vsl , vsr ). depending on the instruction, this shift count is speci?d either by low-order bits of a vector register or by an immediate ?ld in the instruction. in the former case the low-order 7 bits of the shift count register give the shift count in bits (0 count 127). of these 7 bits, the high-order 4 bits give the number of complete bytes by which to shift and are used by vslo and vsro ; the low-order 3 bits give the number of remaining bits by which to shift and are used by vsl and vsr . there are two methods of specifying an inter-element shift or rotate of two source vector registers, extracting 16 bytes as the result vector. there is also a method for shifting a single source vector register left or right by any number of bits. table 4-26 describes the various vector shift instructions. 4.2.5.7.1 immediate interelement shifts/rotates the vector shift left double by octet immediate ( vsidoi ) instruction provides the basic mechanism that can be used to provide inter-element shifts and/or rotates. this instruction is like a vperm , except that the shift count is speci?d as a literal in the instruction rather table 4-25. vector select instruction name mnemonic syntax operation vector select vsel v d ,v a ,v b, v c for each bit, compare the value in v c to the value 0b0 and if it equals 0b0 then load v d with v a? corresponding bit value otherwise compare the value in v c to the value 0b1 and if it equals 0b1 then load v d with v b? corresponding bit value. table 4-26. vector shift instructions name mnemonic syntax operation vector shift left vsl v d ,v a ,v b shift v a left by the 3 lsbs of v b, and place the result into v d if v b value in invalid, the default result is boundely unde?ed vector shift right vsr v d ,v a ,v b shift v a right by the 3 lsbs of v b, and place the result into v d if v b value in invalid, the default result is boundely unde?ed vector shift left double by octet immediate vsldoi v d ,v a ,v b,sh shift v b left by the 3 lsbs of sh value and then or with v a, place the result is into v d if v b value in invalid, the default result is 0 vector shift left by octet vslo v d ,v a ,v b shift v a left by the 3 lsbs of v b, and place the result into v d if v b value in invalid, the default result is 0b000 vector shift right by octet vsro v d ,v a ,v b shift v a right by the 3 lsbs of v b, and place the result into v d if v b value in invalid, the default result is 0b000 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-38 altivec technology programming environments manual motorola altivec uisa instructions than as a control vector in another vector register, as is required by vperm . the result vector consists of the left-most 16 bytes of the rotated 32-byte concatenation of v a: v b, where shift (sh) is the rotate count. table 4-27 below enumerates how various shift functions can be achieved using the vsidoi instruction. 4.2.5.7.2 computed interelement shifts/rotates the load vector for shift left ( lvsl ) instruction and load vector for shift right ( lvsr ) instruction are supplied to assist in shifting and/or rotating vector registers by an amount determined at run time. the input speci?ations have the same form as the vector load and store instructions, that is, it uses register indirect with index addressing mode( r a|0 + r b). this is because one of their primary purposes is to compute the permute control vector necessary for post-load and pre-store shifting necessary for dealing with misaligned vectors. this lvsl instruction can be used to align a big-endian misaligned vector after loading the (aligned) vectors that contain its pieces. the lvsl instruction can be used to misalign a vector register for use in a read-modify-write sequence that will store an misaligned little-endian vector. the lvsr instruction can be used to align a little-endian misaligned vector after loading the (aligned) vectors that contain its pieces. the lvsl instruction can be used to misalign a vector register for use in a read-modify-write sequence that will store an misaligned big-endian vector. for an example on how the lvsl instruction is used to align a vector in big-endian mode see section 3.1.6.1, accessing a misaligned quad word in big-endian mode.?for an example on how lvsr is used to align a vector in little-endian mode see section 3.1.6.2, accessing a misaligned quad word in little-endian mode. table 4-27. coding various shifts and rotates with the vsidoi instruction to get this: code this: operation sh instruction immediate va vb rotate left double 0?5 vsidoi 0?5 msv lsv rotate left double 16?1 vsidoi mod16(sh) lsv msv rotate right double 0?5 vsidoi 16?h msv lsv rotate right double 16?1 vsidoi 16?od16(sh) lsv msv shift left single, zero ?l 0?5 vsidoi 0?5 msv 0x0 shift right single, zero ?l 0?5 vsidoi 16?h 0x0 msv rotate left single 0?5 vsidoi 0?5 msv =va rotate right single 0?5 vsidoi 16?h msv =va f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-39 altivec uisa instructions 4.2.5.7.3 variable interelement shifts a vector register may be shifted left or right by a number of bits speci?d in a vector register. this operation is supported with four instructions, two for right shift and two for left shift. the vector shift left by octet ( vslo ) and vector shift right by octet ( vsro ) instructions shift a vector register from 0 to 15 bytes as speci?d in bits 121?24 of another vector register. the vector shift left ( vsl ) and vector shift right ( vsr ) instructions shift a vector register from 0 to 7 bits as speci?d in another vector register (the shift count must be speci?d in the three lsbs of each byte in the vector and must be identical in all bytes or the result is boundedly unde?ed). in all of these instructions, zeros are shifted into vacated element and bit positions. used sequentially with the same shift-count vector register, these instructions will shift a vector register left or right from 0 to 127 bits as speci?d in bits 121?27 of the shift-count vector register. for example: vslo vz, vx, vy vspltb vy, vy, 15 vsl vz, vz, vy will shift v x by the number of bits speci?d in v y and place the results in v z. with these instructions a full double-register shift can be performed in seven instructions. the following code will shift v w|| v x left by the number of bits speci?d in v y placing the result in v z: vslo t1, vw, vy ; shift the most significant. register left vspltb vy, vy, 15 vsl t1, t1, vy vsububm vy, v0, vy ; adjust count for right shift (v0=0) vsro t2, vx, vy ; right shift least sign. register vsr t2, t2, vy vor vz, t1, t2 ; merge to get the final result 4.2.6 processor control instructions?isa processor control instructions are used to read from and write to the powerpc condition register (cr), machine state register (msr), and special-purpose registers (sprs). see chapter 4, addressing mode and instruction set summary,?in the programming environments manual for 32-bit implementations of the powerpc architecture , for information about the instructions used for reading from and writing to the msr and sprs. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-40 altivec technology programming environments manual motorola altivec vea instructions 4.2.6.1 altivec status and control register instructions table 4-28 summarizes the instructions for reading from or writing to the v ector status and control register ( vscr). for more information on vscr see section in section 2.3.2, ?ector status and control register (vscr). 4.2.7 recommended simpli?d mnemonics to simplify assembly language programs, a set of simpli?d mnemonics is provided for some of the most frequently used operations (such as no-op, load immediate, load address, move register, and complement register). assemblers could provide the simpli?d mnemonics listed below. programs written to be portable across the various assemblers for powerpc architecture should not assume the existence of mnemonics not described in this document. simpli?d mnemonics are provided for the data stream touch ( dst ) and data stream touch for store ( dstst ) instructions so that they can be coded with the transient indicator as part of the mnemonic rather than as a numeric operand. similarly, simpli?d mnemonics are provided for the data stream stop ( dss) instruction so that it can be coded with the all streams indicator is part of the mnemonic. these are shown as examples with the instructions in table 4-29. 4.3 altivec vea instructions powerpc virtual environment architecture (vea) describes the semantics of the memory model that can be assumed by software processes, and includes descriptions of the cache model, cache-control instructions, address aliasing, and other related issues. table 4-28. move to/from condition register instructions name mnemonic syntax operation move to vector status and control register mtvscr crm ,r s place the contents of v b into vscr. move from vector status and control register mfvscr v b place the contents of vscr into v b. table 4-29. simpli?d mnemonics for data stream touch (dst) operation simpli?d mnemonic equivalent to data stream touch (non-transient) dst r a , r b , strm dst r a , r b , strm, 0 data stream touch transient dstt r a , r b , strm dst r a , r b , strm,1 data stream touch for store (non-transient) dstst r a , r b , strm dstst r a , r b , strm,0 data stream touch for transient dststt r a , r b , strm dststt r a , r b , strm,1 data stream stop (one stream) dss strm dss strm, 0 data stream stop all dssall dss 0 , 1 u v o f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-41 altivec vea instructions implementations that conform to the vea also adhere to the uisa, but may not necessarily adhere to the oea. for further details see chapter 4, addressing mode and instruction set summary,?in the programming environments manual for 32-bit implementations of the powerpc architecture. this section describes the additional altivec instructions de?ed for the vea. 4.3.1 memory control instructions?ea memory control instructions include the following types: cache management instructions (user-level and supervisor-level) segment register manipulation instructions segment lookaside buffer management instructions translation lookaside buffer (tlb) management instructions this section describes the user-level cache management instructions de?ed by the vea. see chapter 4, addressing mode and instruction set summary,?in programming environments manual for 32-bit implementations of the powerpc architecture for more information about supervisor-level cache, segment register manipulation, and tlb management instructions. 4.3.2 user-level cache instructions?ea the instructions summarized in this section provide user-level programs the ability to manage on-chip caches if they are implemented. see chapter 5, ?ache model and memory coherency,?in the programming environments manual for 32-bit implementations of the powerpc architecture for more information about cache topics. bandwidth between the processor and memory is managed explicitly by the programmer through the use of cache management instructions. these instructions give software a way to communicate to the cache hardware how it should prefetch and prioritize writeback of data. the principal instruction for this purpose is a software directed cache prefetch instruction called data stream touch ( dst ). other related instructions are provided for complete control of the software directed cache prefetch mechanism. table 4-30 summarizes the directed prefetch cache instructions defined by the vea. note that these instructions are accessible to user-level programs. see section 5.2.1, ?oftware-directed prefetch for further details on the prefetch cache instructions. v o v v f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-42 altivec technology programming environments manual motorola altivec vea instructions table 4-30. user-level cache instructions name mnemonic syntax operation data stream touch dst ra,rb,strm,t this instruction associates the data stream speci?d by the contents of r a and r b with the stream id speci?d by strm. the speci?d data stream is de?ed by the following. ea: ( r a), where r a 0 unit size: ( r b)[3?] if ( r b)[3?] 0; otherwise 32 count: ( r b)[8?5] if ( r b)[8?5 ] 0; otherwise 256 stride: ( r b)[16?1] if ( r b)[16?1] 0; otherwise 32768 the t bit of the instruction indicates whether the data stream is likely to be stored into fairly frequently in the near future (t=0) or to be transient (t=1). if r a=0, the instruction form is invalid. see section 5.2.1.1, ?ata stream touch (dst), for further details on the dst instruction. data stream touch dstt ra,rb,strm,t this instruction associates the data stream speci?d by the contents of registers r a and r b with the stream id speci?d by strm. this instruction is a hint that performance will probably be improved if the cache blocks containing the speci?d data stream are not fetched into the data cache, because the program will probably not load from the stream.that is, the data stream will be relatively transient in nature. that is, it will have poor locality and is likely to be referenced a very few times or over a very short period of time. the memory subsystem can use this persistent/transient knowledge to manage the data as is most appropriate for the speci? design of the cache/memory hierarchy of the processor on which the program is executing. an implementation is free to ignore dstt , in that case it should simply be executed as a dst. however, software should always attempt to use the correct form of dst or dstt regardless of whether the intended processor implements dstt . in this way the program will automatically bene? when run on processors that support dstt . the speci?d data stream is de?ed by the following. ea: ( r a), where r a 0 unit size: ( r b)[3?] if ( r b)[3?] 0; otherwise 32 count: ( r b)[8?5] if ( r b)[8?5] 0; otherwise 256 stride: ( r b)[16?1] if ( r b)[16?1] 0; otherwise 32768 the t bit of the instruction indicates whether the data stream is likely to be accessed into fairly frequently in the near future (t=0) or to be transient (t=1). if r a=0, the instruction form is invalid. see section 5.2.1.2, ?ransient streams, for further details on the dstt instruction. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 4. addressing modes and instruction set summary 4-43 altivec vea instructions data stream touch for store (non-tran sient) dstst ra,rb,strm,t this instruction associates the data stream speci?d by the contents of registers r a and r b with the stream id speci?d by strm. this instruction is a hint that performance will probably be improved if the cache blocks containing the speci?d data stream are fetched into the data cache, because the program will probably soon access into the stream, and that prefetching from any data stream that was previously associated with the speci?d stream id is no longer needed. the hint is ignored for blocks that are caching inhibited. the speci?d data stream is de?ed by the following. ea: ( r a), where r a 0 unit size: ( r b)[3?] if ( r b)[3-7] 0; otherwise 32 count: ( r b)[8?5] if ( r b)[8?5] 0; otherwise 256 stride: ( r b)[16?1] if ( r b)[16?1] 0; otherwise 32768 the t bit of the instruction indicates whether the data stream is likely to be stored into fairly frequently in the near future (t=0) or to be transient (t=1). if r a=0, the instruction form is invalid. see section 5.2.1.3, ?toring to streams (dstst), for further details on the dstst instruction. data stream touch for store dststt ra,rb,strm,t this instruction associates the data stream speci?d by the contents of r a and r b with the stream id speci?d by strm. this instruction is a hint that performance will probably not be improved if the cache blocks containing the speci?d data stream are fetched into the data cache, because the program will probably not access the stream. that is, the data stream will be relatively transient in nature. that is, it will have poor locality and is likely to be referenced a very few times or over a very short period of time. the memory subsystem can use this persistent/transient knowledge to manage the data as is most appropriate for the speci? design of the cache/memory hierarchy of the processor on which the program is executing. the speci?d data stream is de?ed by the following. ea: ( r a), where r a 0 unit size: ( r b)[3?] if ( r b)[3-7] 0; otherwise 32 count: ( r b)[8?5] if ( r b)[8?5] 0; otherwise 256 stride: ( r b)[16?1] if ( r b)[16?1] 0; otherwise 32768 the t bit of the instruction indicates whether the data stream is likely to be stored into fairly frequently in the near future (t=0) or to be transient (t=1). if r a=0, the instruction form is invalid. see section 5.2.1.3, ?toring to streams (dstst), for further details on the dststt instruction. table 4-30. user-level cache instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
4-44 altivec technology programming environments manual motorola altivec vea instructions data stream stop dss strm,a if a = 0 and a data stream associated with the stream id speci?d by strm exists, this instruction terminates prefetching of that data stream. if a = 1, this instruction terminates prefetching of all existing data streams. (the strm ?ld is ignored.) in addition, executing a dss instruction ensures that all memory accesses associated with data stream prefetching caused by preceding dst and dstst instructions that speci?d the same stream id as that speci?d by the dss instruction (a = 0), or by all preceding dst and dstst instructions (a = 1), will be in group g1 with respect to the memory barrier created by a subsequent sync instruction. dss serves as both a basic and an extended mnemonic. the assembler will recognize a dss mnemonic with two operands as the basic form, and a dss mnemonic with one operand as the extended form. execution of a dss instruction causes address translation for the speci?d data stream(s) to cease. prefetch requests for which the effective address has already been translated may complete and may place the corresponding data into the data cache see section 5.2.1.4, ?topping streams, for further details on the dss instruction. data stream stop all dssall terminates prefetching of all existing data streams. all active streams may be stopped. if the optional data stream prefetch facility is implemented, dssall (extended mnemonic for dss ), to terminate any data stream prefetching requested by the interrupted program, in order to avoid prefetching data in the wrong context, consuming memory bandwidth fetching data that are not likely to be needed by the other program, and interfering with data cache use by the other program. the dssall must be followed by a sync , and additional software synchronization may be required. see section 5.2.1.4, ?topping streams, for further details on the dssall instruction. table 4-30. user-level cache instructions (continued) name mnemonic syntax operation f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 5. cache, exceptions, and memory management 5-1 chapter 5 cache, exceptions, and memory management this chapter summarizes details of altivec technology that pertain to cache and memory management models. note that altivec technology de?es most of its instructions at the user level (uisa). because most altivec instructions are computational, there is little effect on the vea and oea portions of the powerpc architecture de?ition. because the altivec instruction set architecture (isa) uses 128-bit operands, additional instructions are provided to optimize cache and memory bus use. 5.1 powerpc shared memory to fully understand the data stream prefetch instructions for altivec, one needs a knowledge of powerpc architecture for shared memory. the powerpc architecture supports the sharing of memory between programs, between different instances of the same program, and between processors and other mechanisms. it also supports access to memory by one or more programs using different effective addresses. all these cases are considered memory sharing. memory is shared in blocks that are an integral number of pages. when the same memory has different effective addresses, the addresses are called aliases. each application can be granted separate access privileges to aliased pages. for more details on how the powerpc architecture supports the sharing of memory see chapter 5, ?ache model and memory coherency?in the programming environments manual for 32-bit implementations of the powerpc architecture. 5.2 altivec memory bandwidth management the altivec isa provides a way for software to speculatively load larger blocks of data from memory. that is, bandwidth otherwise idle can be used to permit software to take advantage of locality and reduces the number of system memory accesses. v u f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
5-2 altivec technology programming environments manual motorola altivec memory bandwidth management 5.2.1 software-directed prefetch bandwidth between the processor and memory is managed explicitly by the programmer using cache management instructions. these instructions let software indicate to the cache hardware how to prefetch and prioritize data writeback. the principle instruction for this purpose is a software-directed cache prefetch instruction, data stream touch ( dst ), described in the following section. 5.2.1.1 data stream touch (dst) the data stream prefetch facility permits a program to indicate that a sequence of units of memory is likely to be accessed soon by memory access instructions. such a sequence is called a data stream or, when the context is clear, simply a stream. a data stream is de?ed by the following: ea?he effective address of the ?st unit in the sequence unit size?he number of quad words in each unit; 0 < unit size 32 count?he number of units in the sequence; 0 < count 256 stride?he number of bytes between the effective address of one unit in the sequence and the effective address of the next unit in the sequence (that is, the effective address of the nth unit in the sequence is ea + (n - 1) x stride); (-32768 stride < 0 or 0 < stride 32768) the units need not be aligned on a particular memory boundary. the stride may be negative. the dst instruction speci?s a starting address, a block size (1?2 vectors), a number of blocks to prefetch (1?56 blocks), and a signed stride in bytes (-32,768 to +32,768 bytes), the 2-bit tag, speci?d as an immediate ?ld in the opcode, identi?s one of four possible touch streams. the starting address of the stream is speci?d in r a (if r a = 0, the instruction form is invalid). blocksize, blockcount, and blockstride are speci?d in r b. do not confuse the term ?ache block? the term ?lock always indicates a powerpc cache block. the format of the r b register is shown in figure 5-1. figure 5-1. format of rb in dst instruction there is no zero-length block size, block count, or block stride. a blocksize of 0 indicates 32 vectors, a blockcount of 0 indicates 256 blocks, and a blockstride of 0 indicates +32,768 bytes. otherwise, these ?lds correspond to the numerical value of the size, count, and stride. do not specify strides smaller than 1 block (16 bytes). blocksize 31 16 15 8 7 3 2 0 blockcount signed blockstride 0 0 0 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 5. cache, exceptions, and memory management 5-3 altivec memory bandwidth management the programmer speci?s block size in terms of vectors (16 bytes), regardless of the cache-block size. hardware automatically optimizes the number of cache blocks it fetches to bring a block into the cache. the number of cache blocks fetched into the cache for each block is the fewest natural cache blocks needed to fetch the entire block, including the effects of block misalignment to cache blocks, as shown in the following: the address of each block in a stream is a function of the streams starting address, the block stride, and the block being fetched. the starting address may be any 32-bit byte address. each blocks address is computed as a full 32-bit byte address from the following: the address of the ?st cache block fetched in each block is that blocks address aligned to the next lower natural cache-block boundary by ignoring log 2 (cacheblocksize) least signi?ant bits (lsbs) (for example, for 32-byte cache-blocks, the ?e lsbs are ignored). cache blocks are then fetched sequentially forward until the entire block of vectors is brought into the cache. an example of a six-block data stream is shown in figure 5-2 figure 5-2. data stream touch executing a dst instruction noti?s the cache/memory subsystem that the program will soon need speci?d data. if bandwidth is available, the hardware starts loading the speci?d stream into the cache. to the extent that hardware can acquire the data, when the loads requiring the data ?ally execute, the target data will be in the cache. executing a second dst to the tag of a stream in progress aborts the existing stream (at hardwares earliest convenience) and establishes a new stream with the same stream tag id. the dst instruction is a hint to hardware and has no architecturally visible effects (in the powerpc uisa sense). the hardware is free to ignore it, to start the prefetch when it can, to abort the stream at any time, or to prioritize other memory operations ahead of it. if a stream is aborted, the program still functions properly, but subsequent loads experience the full latency of a cache miss. cacheblocksfetched = ceiling blocksize + mod(blockaddr,cacheblocksize) cacheblocksize blockaddr n = ( r a) + n (r b) 16?1 where n = {0 ... (blockcount ?1)} and if (( r b) 16?1 = 0) then (( r b) 16?1 32768) 0 12345 starting address = ( r a) blocksize = ( r b) 3? blockstride = ( r b) 16?1 blockadd r n (n = 3) memory stream blockcount = ( r b) 8?5 = 6 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
5-4 altivec technology programming environments manual motorola altivec memory bandwidth management the dst instruction does not introduce implementation problems like those of load/store multiple/string instructions. because dst does not affect the architectural state, it does not cause interlock problems associated with load/store multiple/string instructions. also, dst does take exceptions and requires no complex recovery mechanism. touch instructions should be considered strong hints. using them in highly speculative situations could waste considerable bandwidth. implementations that do not implement the stream mechanism treat stream instructions ( dst , dstt , dsts , dstst , dss , and dssall ) as no-ops. if the stream mechanism is implemented, all four streams must be provided. 5.2.1.2 transient streams the memory subsystem considers dst an indication that its stream data is likely to have some reasonable degree of locality and be referenced several times or over some reasonably long period. this is called persistence. the data stream touch transient instruction ( dstt ) indicates to the memory system that its stream data is transient, that is, it has poor locality and is likely to be used very few times or only for a very short time. a memory subsystem can use this knowledge to manage data for the processors cache/memory design. an implementation may ignore the distinction between transience and persistence; in that case, dstt acts like dst . however, portable software should always use the correct form of dst or dstt regardless of whether the intended processor makes that distinction. 5.2.1.3 storing to streams (dstst) a dst instruction brings a cache block into the cache subsystem in a state most ef?ient for subsequent reading of data from it (load). the companion instruction, data stream touch for store ( dstst ), brings the cache block into the cache subsystem in a state most ef?ient for subsequent writing to it (store). for example, in a mesi cache subsystem, a dst might bring a cache block in shared (s) state, whereas a dstst would bring the cache block in exclusive (e) state to avoid a subsequent demand-driven bus transaction to take ownership of the cache block so the store can proceed. the dstst streams are the same physical streams as dst streams, that is, dstst stream tags are aliases of dst tags. if not implemented, dstst defaults to dst . if dst is not implemented, it is a no-op. the dststt instruction is a transient version of dstst . data stream prefetching of memory locations is not supported when bit 57 of the segment table entry or bit 0 of the segment register (sr) is set. if a dst or dstst instruction speci?s a data stream containing these memory locations, results are unde?ed. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 5. cache, exceptions, and memory management 5-5 altivec memory bandwidth management 5.2.1.4 stopping streams the dst instructions have a counterpart called data stream stop ( dss ). a program can stop any given stream prefetch by executing dss with that streams tag. this is useful when a program speculatively starts a stream prefetch but later determines that the instruction stream went the wrong way. the dss instruction can stop the stream so that no more bandwidth is wasted. all active streams may be stopped by using dssall . this is useful when the operating system needs to stop all active streams (process switch), but does not know how many streams are in progress. because dssall does not specify the number of implemented streams, it should always be used instead of a sequence of dss instructions to stop all streams. neither dss nor dssall is execution synchronizing; the time between when a dss is issued and the stream stops is not speci?d. therefore, when software must ensure that the stream is physically stopped before continuing (for example, before changing virtual memory mapping), a special sequence of synchronizing instructions is required. the sequence can differ for different situations, but the following sequence works in all contexts: dssall ; stop all streams sync ; insert a barrier in memory pipe lwz rn,... ; stick one more operation in memory pipe cmpd rn,rn ; bne- *-4 ; make sure load data is back isync ; wait for all previous instructions to ; complete to ensure ; memory pipe is clear and nothing is ; pending in the old context data stream prefetching for a given stream is terminated by executing the appropriate dss instruction. the termination can be synchronized by executing a sync instruction after the dss instruction if the memory barrier created by sync orders all address translation effects of the subsequent context-altering instructions. otherwise, data dependencies are also required. for example, the following instruction sequence terminates all data stream prefetching before altering the contents of an segment register (sr): dssall ; stop all data stream prefetching sync ; order dssall before load lwz ry,sr_y(rx); load new sr value mtsr y,ry ; alter r y the mtsr instruction cannot be executed until the lwz loads the sr value into r y. the memory access caused by the lwz cannot be performed until the dssall instruction takes effect (that is, until address translation stops for all data streams and all memory accesses associated with data stream prefetches for which the effective address was translated before the translation stops are performed). f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
5-6 altivec technology programming environments manual motorola altivec memory bandwidth management 5.2.1.5 exception behavior of prefetch streams in general, exceptions do not cancel streams. streams are sensitive to whether the processor is in user or supervisor mode (determined by msr[pr]) and whether data address translation is used (determined by msr[dr]). this allows prefetch streams to behave predictably when an exception occurs. streams are suspended in real addressing mode (msr[dr] = 0) and remain suspended until translation is turned back on (msr[dr] is set). a dst instruction issued while msr[dr] = 0 produces boundedly unde?ed results. a stream is suspended whenever the msr[pr] is different from what it was when the dst that established it was issued. for example, if a dst is issued in user mode (msr[pr] = 1), the resulting stream is suspended when the processor enters supervisor mode (msr[pr] = 0) and remains suspended until the processor returns to user mode. conversely, if the dst were issued in supervisor mode, it is suspended if the machine enters user mode. because exceptions do not cancel streams automatically, the operating system must stop streams explicitly when warranted, for example, when switching processes or changing virtual memory context. care must be taken if data stream prefetching is used in supervisor-level state (msr[pr] = 0). after an exception is taken, the supervisor-level program that next changes msr[dr] from 0 to 1 causes data-stream prefetching to resume for any data streams for which the corresponding dst or dstst instruction was executed in supervisor mode; such streams are called supervisor-level data streams. this program is unlikely to be the one that executed the corresponding dst or dstst instruction and is unlikely to use the same address translation context as that in which the dst or dsts t was executed. suspension and resumption of data stream prefetching work more naturally for user level data streams, because the next application program to be dispatched after an exception occurs is likely to be the most recently interrupted program. an exception handler that changes the context in which data addresses are translated may need to terminate data-stream prefetching for supervisor-level data streams and to synchronize the termination before changing msr[dr] to 1. although terminating all data stream prefetching in this case would satisfy the requirements of the architecture, doing so would adversely affect the performance of applications that use data-stream prefetching. thus, it may be better for the operating system to record stream ids associated with any supervisor-level data streams and to terminate prefetching for those streams only. cache effects of supervisor-level data-stream prefetching can also adversely affect performance of applications that use data stream prefetching, as supervisor-level use of the associated stream id can take over an applications data stream. data stream instructions cannot cause exceptions directly. therefore, any event that would cause an exception on a normal load or store, such as a page fault or protection violation, is instead aborted and ignored. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 5. cache, exceptions, and memory management 5-7 altivec memory bandwidth management suspension or termination of data stream prefetching for a given data stream need not cancel prefetch requests for that data stream for which the effective address has been translated and need not cause data returned by such requests to be discarded. however, to improve softwares ability to pace data stream prefetching with data consumption, it may be better to limit the number of these pending requests that can exist simultaneously. 5.2.1.6 synchronization behavior of streams streams are not affected (stopped or suspended) by execution of any powerpc synchronization instructions ( sync , isync , or eieio ). this permits these instructions to be used for synchronizing multiple processors without disturbing background prefetch streams. prefetch streams have no architecturally observable effects and are not affected by synchronization instructions. synchronizing the termination of data stream prefetching is needed only by the operating system 5.2.1.7 address translation for streams like dcbt and dcbtst instructions, dst , dstst , dstt , and dststt are treated as loads with respect to address translation, memory protection, and reference and change recording. unlike dcbt and dcbtst instructions, stream instructions that cause a tlb miss cause a page table search and the page descriptor to be loaded into the tlb. conceptually, address translation and protection checking is performed on every cache-block access in the stream and proceeds normally across page boundaries and tlb misses, terminating only on page faults or protection violations that cause a dsi exception. stream instructions operate like normal powerpc cache instructions (such as dcbt ) with respect to guarded memory; they are not subject to normal restrictions against prefetching in guarded space because they are program-directed. however, speculative dst instructions can not start a prefetch stream to guarded space. if the effective address of a cache block within a data stream cannot be translated, or if loading from the block would violate memory protection, the processor will terminate prefetching of that stream. (continuing to prefetch subsequent cache blocks within the stream might cause prefetching to get too far ahead of consumption of prefetched data.) if the effective address can be translated, a tlb miss can cause such termination, even on implementations for which tlbs are reloaded in software. 5.2.1.8 stream usage notes a given data stream exists if a dst or dstst instruction has been executed that speci?s the stream and prefetching of the stream has neither completed, terminated, or been supplanted. prefetching of the stream has completed, when all the memory locations within the stream that will ever be prefetched as a result of executing the dst or dstst instruction have been prefetched (for example, locations for which the effective address cannot be translated will f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
5-8 altivec technology programming environments manual motorola altivec memory bandwidth management never be prefetched). prefetching of the stream is terminated by executing the appropriate dss instruction; it is supplanted by executing another dst or dstst instruction that speci?s the stream id associated with the given stream. because there are four stream ids, as many as four data streams may exist simultaneously. the maximum block count of dst is small because of its preferred usage. it is not intended for a single dst instruction to prefetch an entire data stream. instead, dst instructions should be issued periodically, for example on each loop iteration, for the following reasons: short, frequent dst instructions better synchronize the stream with the consumption of data. with prefetch closely synchronized just ahead of consumption, another activity is less likely to inadvertently evict prefetched data from the cache before it is needed. the prefetch stream is restarted automatically after an exception (that could have caused the stream to be terminated by the operating system) with no additional complex hardware mechanisms needed to restart the prefetch stream. issuing new dst instructions to stream tag ids in progress terminates old streams dst instructions cannot be queued. for example, when multiple dst instructions are used to prefetch a large stream, it would be poor strategy to issue a second dst whose stream begins at the speci?d end of the ?st stream before it was certain that the ?st stream had completed. this could terminate the ?st stream prematurely, leaving much of the stream unprefetched. paradoxically, it would also be unwise to wait for the ?st stream to complete before issuing the second dst . detecting completion of the ?st stream is not possible, so the program would have to introduce a pessimistic waiting period before restarting the stream and then incur the full start-up latency of the second stream. the correct strategy is to issue the second dst well before the anticipated completion of the ?st stream and begin it at an address overlapping the ?st stream by an amount suf?ient to cover any portion of the ?st stream that could not yet have been prefetched. issuing the second dst too early is not a concern because blocks prefetched by the ?st stream hit in the cache and need not be refetched. thus, even if issued prematurely and overlapped excessively, the second dst rapidly advances to the point of prefetching new blocks. this strategy allows a smooth transition from the ?st stream to the second without signi?ant breaks in the prefetch stream. for the greatest performance bene? from data-stream prefetching, use the dst and dstst (and dss ) instructions so that the prefetched data is used soon after it is available in the data cache. pacing data stream prefetching with consumption increases the likelihood that prefetched data is not displaced from the cache before it is used, and reduces the likelihood that prefetched data displaces other data needed by the program. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 5. cache, exceptions, and memory management 5-9 altivec memory bandwidth management specifying each logical data stream as a sequence of shorter data streams helps achieve the desired pacing, even in the presence of exceptions, and address translation failures. the components of a given logical data stream should have the following attributes: the same stream id should be associated with each component. the components should partially overlap (that is, the ?st part of a component should consist of the same memory locations as the last part of the preceding component). the memory locations that do not overlap with the next component should be large enough that a substantial portion of the component is prefetched. that is, prefetch enough memory locations for the current component before it is taken over by the prefetching being done for the next component. 5.2.1.9 stream implementation assumptions some processors can treat dst instructions as no-ops. however, if a processor implements dst , a minimum level of functionality is provided to create as consistent a programming model across different machines as possible. a program can assume the following functionality in a dst instruction: implements all four tagged streams implements each tagged stream as a separate, independent stream with arbitration for memory access performed on a round-robin basis. searches the table for each stream access that misses in the tlb. does not abort streams on page boundary crossings does not abort streams on exceptions (except dsi exceptions caused by the stream). does not abort streams, or delay execution pending completion of streams, on powerpc synchronization instructions sync , isync , or eieio . does not abort streams on tlb misses that occur on loads or stores issued concurrently with running streams. however, a dsi exception from one of those loads or stores may cause streams to abort. 5.2.2 prioritizing cache block replacement load vector indexed lru ( lvxl ) and store vector indexed lru ( stvxl ) instructions provide explicit control over cache block replacement by letting the programmer indicate whether an access is likely to be the last reference made to the cache block containing this load or store. the cache hardware can then prioritize replacement of this cache block over others with older but more useful data. data accessed by a normal load or store is likely to be needed more than once. marking this data as most-recently used (mru) indicates that it should be a low-priority candidate for f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
5-10 altivec technology programming environments manual motorola dsi exception?ata address breakpoint replacement. however, some data, such as that used in dsp multimedia algorithms, is rarely reused and should be marked as the highest priority candidate for replacement. normal accesses mark data mru. data unlikely to be reused can be marked lru. for example, on replacing a cache block marked lru by one of these instructions, a processor may improve cache performance by evicting the cache block without storing it in intermediate levels of the cache hierarchy (except to maintain cache consistency). 5.2.3 partially executed altivec instructions the oea permits certain instructions to be partially executed when an alignment or dsi exception occurs. in the same way that the target register may be altered when ?ating-point load instructions cause a dsi exception, if the altivec facility is implemented, the target register ( v d) may be altered when lvx or lvxl is executed and the tlb entry is invalidated before the access completes. exceptions cause data stream prefetching to be suspended for all existing data streams. prefetching for a given data stream resumes when control is returned to the interrupted program, if the stream still exists (for example, the operating system did not terminate prefetching for the stream). 5.3 dsi exception?ata address breakpoint a data address breakpoint register (dabr) match causes a dsi exception in implementations that support the data breakpoint feature. when a dabr match occurs on a non-altivec processor that support the powerpc architecture, the dar is set to any effective address between and including the word (for a byte, half word, or word access) speci?d by the effective address computed by the instruction and the effective address of the last byte in the word or double word in which the match occurred. in processors that support the altivec technology, this would include a quad-word access from an lvx , lvxl , stvx , or stvxl instruction to a segment or bat area. 5.4 altivec unavailable exception (0x00f20) the altivec facility includes an additional instruction-caused, precise exception to those de?ed by the oea and discussed in chapter 6, ?xceptions,?in the programming environments manual for 32-bit implementations of the powerpc architecture . an altivec unavailable exception occurs when no higher priority exception exists (see table 5-2), an attempt is made to execute an altivec instruction, and msr[vec] = 0. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 5. cache, exceptions, and memory management 5-11 altivec unavailable exception (0x00f20) register settings for altivec unavailable exceptions are described in table 5-1 and shown in figure 5-3. when an altivec unavailable exception is taken, instruction execution resumes as offset 0x00f20 from the base address determined by msr[ip]. the dst and dstst instructions are supported if msr[dr] = 1. if either instruction is executed when msr[dr] = 0 (real addressing mode), results are boundedly unde?ed. conditions that cause this exception are prioritized among instruction-caused (synchronous), precise exceptions as shown in table 5-2, taken from the section ?xception priorities,?in chapter 6, ?xceptions,?in the programming environments manual for 32-bit implementations of the powerpc architecture . table 5-1. altivec unavailable exception?egister settings register setting description srr0 set to the effective address of the instruction that caused the exception srr1 32-bit 0loaded with equivalent bits from the msr 1?cleared 5?loaded with equivalent bits from the msr 10?5cleared 16?1 loaded with equivalent bits from the msr note that depending on the implementation, additional msr bits may be copied to srr1. msr sf 1 isf vec 0 pow 0 ile ee 0 pr 0 fp 0 me fe0 0 se 0 be 0 fe1 0 ip ir 0 dr 0 ri 0 le set to value of ile 01 45 910 15 setting after exception msr[0] 0000 msr[5?] 00_0000 16 31 setting after exception msr[16?1] figure 5-3. srr1 bit settings after an altivec unavailable exception f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
5-12 altivec technology programming environments manual motorola altivec unavailable exception (0x00f20) table 5-2. exception priorities (synchronous/precise exceptions) priority exception 3 1 1 the exceptions are third in priority after system reset and machine check exceptions instruction dependent?hen an instruction causes an exception, the exception mechanism waits for any instructions prior to the excepting instruction in the instruction stream to complete. any exceptions caused by these instructions are handled ?st. it then generates the appropriate exception if no higher priority exception exists when the exception is to be generated. note that a single instruction can cause multiple exceptions. when this occurs, those exceptions are ordered in priority as indicated in the following: a. integer loads and stores a. alignment b. dsi c. trace (if implemented) b. floating-point loads and stores a. floating-point unavailable b. alignment c. dsi d. trace (if implemented) c. other ?ating-point instructions a. floating-point unavailable b. program?recise-mode ?ating-point enabled exception c. floating-point assist (if implemented) d. trace (if implemented) d. altivec loads and stores (if altivec facility implemented) a. altivec unavailable b. dsi c. trace (if implemented) e. other altivec instructions (if altivec facility implemented) a. altivec unavailable b. trace (if implemented) f. the r? and mtmsr a. program?upervisor level instruction b. program?recise-mode ?ating-point enabled exception c. trace (if implemented), for mtmsr only if precise-mode ieee ?ating-point enabled exceptions are enabled and fpscr[fex] is set, a program exception occurs no later than the next synchronizing event. g. other instructions a. these exceptions are mutually exclusive and have the same priority: ?program: trap ?system call ( sc ) ?program: supervisor level instruction ?program: illegal instruction b. trace (if implemented) f. isi exception the isi exception has the lowest priority in this category. it is only recognized when all instructions prior to the instruction causing this exception appear to have completed and that instruction is to be executed. the priority of this exception is speci?d for completeness and to ensure that it is not given more favorable treatment. an implementation can treat this exception as though it had a lower priority. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-1 chapter 6 altivec instructions this chapter lists the altivec instruction set in alphabetical order by mnemonic. note that each entry includes the instruction format and a graphical representation of the instruction. all the instructions are 32 bit and a description of the instruction ?lds and pseudocode conventions are also provided. for more information on the altivec instruction set, refer to chapter 4 addressing modes and instruction set summary.?for more information on the powerpc instruction set, refer to chapter 8, ?nstruction set,?in the programming environments manual for 32-bit implementations of the powerpc architecture . 6.1 instruction formats altivec instructions are four bytes (32 bits) long and are word-aligned. altivec instruction set architecture (isa) has four operands, three source vectors, and one result vector. bits 0? always specify the primary opcode for altivec instructions. altivec alu-type instructions specify the primary opcode point 4 (0b00_01_00). altivec load, store, and stream prefetch instructions use secondary opcode in primary opcode 31 (0b01_11_11). within a vector register, a byte, half-word, or word element are referred to as follows: byte elements, each byte = 8 bits; in the pseudocode, n = 8 with a total of 16 elements half-word elements, each byte = 16 bits; in the pseudocode, n = 16 with a total of 8 elements word elements, each byte = 32 bits; in the pseudocode, n = 32 with a total of 4 elements refer to figure 1-3 for an example of how elements are placed in a vector register. 6.1.1 instruction fields table 6-1 describes the instruction ?lds used in the various instruction formats. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-2 altivec technology programming environments manual motorola altivec technology programming environments manual 6.1.2 notation and conventions the operation of some instructions is described by a semiformal language (pseudocode). see table 6-2 for a list of additional pseudocode notation and conventions used throughout this section. table 6-1. instruction syntax conventions field description opcd (0?) primary opcode ?ld r a, a (11?5) speci?s a gpr to be used as a source or destination r b, b (16?0) speci?s a gpr to be used as a source rc (31) record bit 0 does not update the condition register (cr). 1 for the optional altivec facility, set cr ?ld 6 to control program ?w as described in section 2.4.1, ?owerpc condition register v a (11?5) speci?s a vector register to be used as a source v b (16?0) speci?s a vector register to be used as a source v c (21?5) speci?s a vector register to be used as a source v d (6?0) speci?s a vector register to be used as a destination v s (6?0) speci?s a vector register to be used as a source shb (22?5) speci?s a shift amount in bytes. simm (11?5) this immediate ?ld is used to specify a (5-bit) signed integer. uimm (11?5) this immediate ?ld is used to specify a 4-, 8-,12-, or 16-bit unsigned integer. table 6-2. notation and conventions notation/convention meaning assignment ? not logical operator do i=x to y by z do the following starting at x and iterating to y by z + int 2s complement integer add - int 2s complement integer subtract + u i unsigned integer add - ui unsigned integer subtract * ui unsigned integer multiply + si signed integer add - si signed integer subtract * si signed integer multiply f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-3 instruction formats * sui signed integer (?st operand) multiplied by unsigned integer (second operand) producing signed result / integer divide + fp single-precision ?ating-point add - fp single-precision ?ating-point subtract * fp single-precision ?ating-point multiply fp single-precision ?ating-point divide fp single-precision ?ating-point square root < ui, ui, > ui, ui unsigned integer comparison relations < si, si, > si, si signed integer comparison relations < fp, fp, > fp, fp single precision ?ating point comparison relations not equal = int integer equal to = ui unsigned integer equal to = si signed integer equal to = fp floating-point equal to x >> ui y shift x right by y bits extending xs vacated bits with zeros x >> si y shift x right by y bits extending xs vacated bits with the sign bit of x x << ui y shift x left by y bits inserting xs vacated bits with zeros || used to describe the concatenation of two values (that is, 010 || 111 is the same as 010111) & and logical operator | or logical operator , exclusive-or, equivalence logical operators (for example, (a b) = (a ?b)) 0b nnnn a number expressed in binary format. 0x nnnn a number expressed in hexadecimal format. ? unordered comparison relation x 0 x zeros x 1 x ones x y x copies of y x y bit y of x x y:z bits y through z, inclusive, of x length(x) length of x, in bits. if x is the word ?lemen, length(x) is the length, in bits, of the element implied by the instruction mnemonic. table 6-2. notation and conventions (continued) notation/convention meaning f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-4 altivec technology programming environments manual motorola altivec technology programming environments manual rotl(x,y) result of rotating x left by y bits uitouimod(x,y) chop unsigned integer x- to y-bit unsigned integer uitouisat(x,y) result of converting the unsigned-integer x to a y-bit unsigned-integer with unsigned-integer saturation sitouisat(x,y) result of converting the signed-integer x to a y-bit unsigned-integer with unsigned-integer saturation sitosimod(x,y) chop integer x- to y-bit integer sitosisat(x,y) result of converting the signed-integer x to a y-bit signed-integer with signed-integer saturation rndtonearfp32 the single-precision floating-point number that is nearest in value to the infinitely-precise floating-point intermediate result x (in case of a tie, the even single-precision floating-point value is used). rndtofpint32near the value x if x is a single-precision floating-point integer; otherwise the single-precision floating-point integer that is nearest in value to x (in case of a tie, the even single-precision floating-point integer is used). rndtofpint32trunc the value x if x is a single-precision floating-point integer; otherwise the largest single-precision floating-point integer that is less than x if x>0, or the smallest single-precision floating-point integer that is greater than x if x<0 rndtofpint32ceil the value x if x is a single-precision ?ating-point integer; otherwise the smallest single-precision ?ating-point integer that is greater than x rndtofpint32floor the value x if x is a single-precision floating-point integer; otherwise the largest single-precision floating-point integer that is less than x cnvtfp32toui32sat(x) result of converting the single-precision floating-point value x to a 32-bit unsigned-integer with unsigned-integer saturation cnvtfp32tosi32sat(x) result of converting the single-precision floating-point value x to a 32-bit signed-integer with signed-integer saturation cnvtui32tofp32(x) result of converting the 32-bit unsigned-integer x to floating-point single format cnvtsi32tofp32(x) result of converting the 32-bit signed-integer x to floating-point single format mem(x,y) value at memory location x of size y bytes swapdouble swap the doublewords in a quadword vector zeroextend(x,y) zero-extend x on the left with zeros to produce y-bit value signextend(x,y) sign-extend x on the left with sign bits (that is, with copies of bit 0 of x) to produce y-bit value rotateleft(x,y) rotate x left by y bits mod(x,y) remainder of x/y uimaximum(x,y) maximum of 2 unsigned integer values, x and y simaximum(x,y) maximum of 2 unsigned integer values, x and y fpmaximum(x,y) maximum of 2 ?ating-point values, x and y uiminimum(x,y) minimum of 2 unsigned integer values, x and y table 6-2. notation and conventions (continued) notation/convention meaning f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-5 instruction formats siminimum(x,y) minimum of 2 unsigned integer values, x and y fpminimum(x,y) minimum of 2 ?ating-point values, x and y fpreciprocalestimate12(x) 12-bit-accurate ?ating-point estimate of 1/x fpreciprocalsqrtestimate12(x) 12-bit-accurate ?ating-point estimate of 1/(sqrt(x)) fplog 2 estimate3(x) 3-bit-accurate ?ating-point estimate of log2(x) fppower2estimate3(x) 3-bit-accurate ?ating-point estimate of 2**x carryout(x + y) carry out of the sum of x and y rotl[64](x, y) result of rotating the 64-bit value x left y positions rotl[32](x, y) result of rotating the 32-bit value x || x left y positions, where x is 32 bits long 0b nnnn a number expressed in binary format. 0x nnnn a number expressed in hexadecimal format. ( n )x the replication of x, n times (that is, x concatenated to itself n ?1 times). ( n )0 and ( n )1 are special cases. a description of the special cases follows: ?( n )0 means a ?ld of n bits with each bit equal to 0. thus (5)0 is equivalent to 0b00000. ?( n )1 means a ?ld of n bits with each bit equal to 1. thus (5)1 is equivalent to 0b11111. ( r a|0) the contents of r a if the r a ?ld has the value 1?1, or the value 0 if the r a ?ld is 0. ( r x) the contents of r x x[ n ] n is a bit or ?ld within x, where x is a register x n x is raised to the n th power abs(x) absolute value of x ceil(x) least integer x characterization reference to the setting of status bits in a standard way that is explained in the text. cia current instruction address. the 32-bit address of the instruction being described by a sequence of pseudocode. used by relative branches to set the next instruction address (nia) and by branch instructions with lk = 1 to set the link register. does not correspond to any architected register. clear clear the leftmost or rightmost n bits of a register to 0. this operation is used for rotate and shift instructions. clear left and shift left clear the leftmost b bits of a register, then shift the register left by n bits. this operation can be used to scale a known non-negative array index by the width of an element. these operations are used for rotate and shift instructions. cleared bits = 0. table 6-2. notation and conventions (continued) notation/convention meaning f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-6 altivec technology programming environments manual motorola altivec technology programming environments manual do do loop. ?indenting shows range. ??o and/or ?y clauses specify incrementing an iteration variable. ??hile clauses give termination conditions. double(x) result of converting x from ?ating-point single-precision format to ?ating-point double-precision format extract select a ?ld of n bits starting at bit position b in the source register, right or left justify this ?ld in the target register, and clear all other bits of the target register to zero. this operation is used for rotate and shift instructions. exts(x) result of extending x on the left with sign bits gpr(x) general-purpose register x if...then...else... conditional execution, indenting shows range, else is optional insert select a ?ld of n bits in the source register, insert this ?ld starting at bit position b of the target register, and leave other bits of the target register unchanged. (no simpli?d mnemonic is provided for insertion of a ?ld when operating on double words; such an insertion requires more than one instruction.) this operation is used for rotate and shift instructions. (note that simpli?d mnemonics are referred to as extended mnemonics in the architecture speci?ation.) leave leave innermost do loop, or the do loop described in leave statement. mask(x, y) mask having ones in positions x through y (wrapping if x > y) and zeros elsewhere. mem(x, y) contents of y bytes of memory starting at address x. nia next instruction address, which is the 32-bit address of the next instruction to be executed (the branch destination) after a successful branch. in pseudocode, a successful branch is indicated by assigning a value to nia. for instructions which do not branch, the next instruction address is cia + 4. does not correspond to any architected register. oea powerpc operating environment architecture rotate rotate the contents of a register right or left n bits without masking. this operation is used for rotate and shift instructions. rotl[64](x, y) result of rotating the 64-bit value x left y positions rotl[32](x, y) result of rotating the 64-bit value x || x left y positions, where x is 32 bits long set bits are set to 1. shift shift the contents of a register right or left n bits, clearing vacated bits (logical shift). this operation is used for rotate and shift instructions. single(x) result of converting x from ?ating-point double-precision format to ?ating-point single-precision format. spr(x) special-purpose register x trap invoke the system trap handler. unde?ed an unde?ed value. the value may vary from one implementation to another, and from one execution to another on the same implementation. table 6-2. notation and conventions (continued) notation/convention meaning f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-7 instruction formats table 6-3 describes instruction ?ld notation conventions used throughout this chapter. precedence rules for pseudocode operators are summarized in table 6-4. operators higher in table 6-4 are applied before those lower in the table. operators at the same level in the table associate from left to right, from right to left, or not at all, as shown in the associativity column. for example, ? (unary minus) associates from left to right, so a - b - c = (a - b) - c. parentheses are used to override the evaluation order implied by uisa powerpc user instruction set architecture vea powerpc virtual environment architecture table 6-3. instruction field conventions the powerpc architecture speci?ation equivalent in altivec technology pem as: ra, rb, rt, rs r a, r b, r d, r s si simm u imm ui uimm va, vb, vc, vt, vs v a, v b, v c, v d, v s /, //, /// 0...0 (shaded) table 6-4. precedence rules operators associativity x[ n ], function evaluation left to right ( n )x or replication, x( n ) or exponentiation right to left unary ? right to left ? , left to right +, - left to right || left to right =, , <, , >, , u, ? left to right &, , left to right | left to right ?(range), : (range) none , iea none table 6-2. notation and conventions (continued) notation/convention meaning f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-8 altivec technology programming environments manual motorola altivec technology programming environments manual table 6-4, or to increase clarity; parenthesized expressions are evaluated before serving as operands. 6.2 altivec instruction set the remainder of this chapter lists and describes the instruction set for the altivec architecture. the instructions are listed in alphabetical order by mnemonic. the diagram below shows the format for each instruction description page. vaddsbs vaddsbs vector add signed byte saturate vaddsbs v d ,v a ,v b form vx do i=0 to 127 by 8 aop 0:8 signextend(( v a) i:i+7 ,9) bop 0:8 signextend(( v b) i:i+7 ,9) temp 0:8 aop 0:8 + int bop 0:8 v d i:i+7 sitosisat(temp 0:8 ,8) end eacj element of vaddsbs is a byte . each signed-integer element in v a is added to the corresponding signed-integer element in v b. if the sum is greater than (2 7 -1) it saturates to (2 7 -1) and if it is less than -2 7 it saturates to -2 7 . if saturations occurs, the sat bit is set. the signed-integer result is placed into the corresponding element of v d. other registers altered: vector status and control register (vscr): affected: sat figure 6-11 shows the usage of the vaddsbs instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long.. figure 6-11. vaddsbs?add saturating sixteen signed integer elements (8-bit) 04 v d v a v b 768 0 5 6 1011 1516 2021 25262728 31 + + + + + + + + + + + + + + + + v a v b v d instruction name instruction syntax and form instruction encoding in decimal pseudocode description of instruction operation text description of instruction operation figure showing instruction usage f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-9 altivec instruction set dss dss data stream stop dss strm (a=0) form x dssall strm (a=1) datastreamprefetchcontrol ?top? || strm note that a does not represent r a in this instruction. if a=0 and a data stream associated with the stream id speci?d by strm exists, this instruction terminates prefetching of that data stream. it has no effect if the speci?d stream does not exist. if a=1, this instruction terminates prefetching of all existing data streams (the strm ?ld is ignored.) in addition, executing a dss instruction ensures that all accesses associated with data stream prefetching caused by preceding dst and dstst instructions that speci?d the same stream id as that speci?d by the dss instruction (a=0), or by all preceding dst and dstst instructions (a=1), will be in group g1 with respect to the memory barrier created by a subsequent sync instruction, refer to section 5.1, ?owerpc shared memory,?for more information. see section 5.2.1, ?oftware-directed prefetch?for more information on using the dss instruction. other registers altered: none simpli?d mnemonics: dss strm equivalent to dss strm , 0 dssall equivalent to dss 0 , 1 for more information on the dss instruction, refer to chapter 5, ?ache, exceptions, and memory management. 31 a 0_0 strm 0_0000 0000_0 822 0 0 56789 10 11 12 13 14 15 16 17 18 19 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-10 altivec technology programming environments manual motorola altivec technology programming environments manual dst dst data stream touch dst r a, r b,strm (t=0) form x dstt r a, r b,strm (t=1) addr 0:63 ( r a) datastreamprefetchcontrol ?tart? || strm || t || ( r b) || addr this instruction initiates a software directed cache prefetch. the instruction is a hint to hardware that performance will probably be improved if the cache blocks containing the speci?d data stream are fetched into the data cache because the program will probably soon load from the stream. the instruction associates the data stream speci?d by the contents of r a and r b with the stream id speci?d by strm . the instruction de?es a data stream strm as starting at an effective address ( r a) and having count units of size quad words separated by stride bytes (as speci?d in r b). the t bit of the instruction indicates whether the data stream is likely to be loaded from fairly frequently in the near future ( t = 0) or to be transient and referenced very few times ( t = 1). the dst instruction does the following: de?es the characteristics of a data stream strm by the contents of r a and r b associates the stream with a speci?d stream id, strm (range for strm is 0-3) indicates that the data in the speci?d stream strm starting at the address in r a may soon be loaded indicates whether memory locations within the stream are likely to be needed over a longer period of time ( t =0) or be treated as transient data ( t =1) terminates prefetching from any stream that was previously associated with the speci?d stream id, strm . 31 t 0_0 strm a b 342 0 0 56789 10 11 15 16 20 21 30 31 0 12345 startingaddress block size blockstride blockaddr n (n=3) memory stream block block block block block block f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-11 altivec instruction set the speci?d data stream is encoded for 32-bit follows: effective address: r a, where r a 0 block size: r b[3?] if r b[3?] 0; otherwise 32 block count: r b[8?5] if r b[8?5] 0; otherwise 256 block stride: r b[16?1] if r b[16?1] 0; otherwise 32768 other registers altered: none simpli?d mnemonics: dst r a, r b,strm equivalent to dst r a, r b , strm, 0 dstt r a, r b,strm equivalent to dst r a, r b , strm, 1 for more information on the dst instruction, refer to chapter 5, ?ache, exceptions, and memory management. /// block size block count block stride 023 78 1516 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-12 altivec technology programming environments manual motorola altivec technology programming environments manual dstst dstst data stream touch for store dstst r a, r b,strm (t=0) form x dststt r a, r b,strm (t=1) addr 0:63 ( r a) datastreamprefetchcontrol ?tart? || t || static || ( r b) || addr this instruction initiates a software directed cache prefetch. the instruction is a hint to hardware that performance will probably be improved if the cache blocks containing the speci?d data stream are fetched into the data cache because the program will probably soon write to (store into) the stream. the instruction associates the data stream speci?d by the contents of r a and r b with the stream id speci?d by strm . the instruction de?es a data stream strm as starting at an effective address ( r a) and having count units of size quad words separated by stride bytes (as speci?d in r b). the t bit of the instruction indicates whether the data stream is likely to be stored into fairly frequently in the near future ( t = 0) or to be transient and referenced very few times ( t = 1). the dstst instruction does the following: de?es the characteristics of a data stream strm by the contents of r a and r b associates the stream with a speci?d stream id, strm (range for strm is 0-3) indicates that the data in the speci?d stream strm starting at the address in r a may soon be stored in to memory indicates whether memory locations within the stream are likely to be stored into fairly frequently in the near future ( t =0) or be treated as transient data ( t =1) terminates prefetching from any stream that was previously associated with the speci?d stream id, strm . 31 t 0_0 strm a b 374 0 0 56789 10 11 15 16 20 21 30 31 0 12345 startingaddress block size blockstride blockaddr n (n=3) memory stream block block block block block block f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-13 altivec instruction set the speci?d data stream is encoded for 32-bit follows: effective address: r a, where r a 0 block size: r b[3?] if r b[3?] 0; otherwise 32 block count: r b[8?5] if r b[8?5] 0; otherwise 256 block stride: r b[16?1] if r b[16?1] 0; otherwise 32768 other registers altered: none simpli?d mnemonics: dstst r a, r b,strm equivalent to dstst r a, r b , strm, 0 dststt r a, r b,strm equivalent to dstst r a, r b , strm, 1 for more information on the dstst instruction, refer to chapter 5, ?ache, exceptions, and memory management. /// block size block count block stride 023 78 1 5 1 6 31 figure 6-1. format of rb in dst instruction (32-bit) f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-14 altivec technology programming environments manual motorola altivec technology programming environments manual lvebx lvebx load vector element byte indexed lvebx v d, r a, r b form x for 32-bit: if r a=0 then b 0 else b ( r a) ea b + ( r b) eb ea 28:31 v d undefined if the processor is in big-endian mode then v d eb*8:(eb*8)+7 mem(ea,1) else v d 120-(eb*8):127-(eb*8) mem(ea,1) ea = ( r a|0)+( r b); m = ea[28-31] (the offset of the byte in its aligned quadword). for big-endian mode, the byte addressed by ea is loaded into byte m of v d. in little-endian mode, it is loaded into byte (15?) of v d. remaining bytes in v d are unde?ed. other registers altered: none 31 v dab 7 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-15 altivec instruction set figure 6-2. effects of example load/store instructions x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0x0000_0000 0x0000_0010 0x0000_0020 0x0000_0030 0x0000_0040 0x0000_0050 0x0000_0060 0x0000_0070 0x0000_0080 0x0000_0090 0x0000_00a0 0x0000_00b0 byte at x1e half at x2a word at x54 quad at a0 v r v r v r v r load or store: memory x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x note: in vector registers, x means boundedly unde?ed after a load and don? care after a store. in memory, x means don? care after a load, and leave at current value after a store. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-16 altivec technology programming environments manual motorola altivec technology programming environments manual lvehx lvehx load vector element half word indexed lvehx v d, r a, r b form x for 32-bit: if r a=0 then b 0 else b ( r a) ea (b + ( r b)) & (~1) eb ea 28:31 v d undefined if the processor is in big-endian mode then v d( eb*8):(eb*8)+15 mem(ea,2) else v d 112-(eb*8):127-(eb*8) mem(ea,2) let the ea be the result of anding the sum ( r a|0)+( r b) with ~1. let m = ea[28-30]; m is the half-word offset of the half-word in its aligned quadword in memory. if the processor is in big-endian mode, the half-word addressed by ea is loaded into half-word m of v d. if the processor is in little-endian mode, the half-word addressed by ea is loaded into half-word (7-m) of v d. the remaining half-word s in v d are set to unde?ed values. figure 6-2 shows this instruction. other registers altered: none 31 v dab 39 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-17 altivec instruction set lvewx lvewx load vector element word indexed lvewx v d, r a, r b form x for 32-bit: if r a=0 then b 0 else b ( r a) ea (b + ( r b)) & (~3) eb ea 28:31 v d undefined if the processor is in big-endian mode then v d eb*8:(eb*8)+31 mem(ea,4) else v d 96-(eb*8):127-(eb*8 ) mem(ea,4) let the ea be the result of anding the sum ( r a|0)+( r b) with ~3. let m = ea[28?9]; m is the word offset of the word in its aligned quadword in memory. if the processor is in big-endian mode, the word addressed by ea is loaded into word m of v d. if the processor is in little-endian mode, the word addressed by ea is loaded into word (3-m) of v d. the remaining words in v d are set to unde?ed values. figure 6-2 shows this instruction. other registers altered: none 31 v dab 71 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-18 altivec technology programming environments manual motorola altivec technology programming environments manual lvsl lvsl load vector for shift left lvsl v d, r a, r b form x for 32-bit: if r a = 0 then b 0 else b ( r a) addr 0:31 b + ( r b) sh addr 28-31 if sh = 0x0 then ( v d) 0:127 0x000102030405060708090a0b0c0d0e0f if sh = 0x1 then ( v d) 0:127 0x0102030405060708090a0b0c0d0e0f10 if sh = 0x2 then ( v d) 0:127 0x02030405060708090a0b0c0d0e0f1011 if sh = 0x3 then ( v d) 0:127 0x030405060708090a0b0c0d0e0f101112 if sh = 0x4 then ( v d) 0:127 0x0405060708090a0b0c0d0e0f10111213 if sh = 0x5 then ( v d) 0:127 0x05060708090a0b0c0d0e0f1011121314 if sh = 0x6 then ( v d) 0:127 0x060708090a0b0c0d0e0f101112131415 if sh = 0x7 then ( v d) 0:127 0x0708090a0b0c0d0e0f10111213141516 if sh = 0x8 then ( v d) 0:127 0x08090a0b0c0d0e0f1011121314151617 if sh = 0x9 then ( v d) 0:127 0x090a0b0c0d0e0f101112131415161718 if sh = 0xa then ( v d) 0:127 0x0a0b0c0d0e0f10111213141516171819 if sh = 0xb then ( v d) 0:127 0x0b0c0d0e0f101112131415161718191a if sh = 0xc then ( v d) 0:127 0x0c0d0e0f101112131415161718191a1b if sh = 0xd then ( v d) 0:127 0x0d0e0f101112131415161718191a1b1c if sh = 0xe then ( v d) 0:127 0x0e0f101112131415161718191a1b1c1d if sh = 0xf then ( v d) 0:127 0x0f101112131415161718191a1b1c1d1e let the ea be the sum ( r a|0)+( r b). let sh = ea[28?1]. let x be the 32-byte value 0x00 || 0x01 || 0x02 || ... || 0x1e || 0x1f. bytes sh:sh+15 of x are placed into v d. figure 6-3 shows how this instruction works. other registers altered: none figure 6-3. load vector for shift left 31 v dab 6 0 056 10 11 15 16 20 21 30 31 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b r a 0 0 0 0 0 0 0 8 r b temp v d 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 c table lookup + = f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-19 altivec instruction set the above lvsl instruction followed by a vector permute ( vperm ) would do a simulated alignment of a four-element ?ating-point vector misaligned on quad-word boundary at address 0x0....c. figure 6-4. instruction vperm used in aligning data refer, also, to the description of the lvsr instruction for suggested uses of the lvsl instruction. v c c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b v a v b v d 0123456789abcdef 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-20 altivec technology programming environments manual motorola altivec technology programming environments manual lvsr lvsr load vector for shift right lvsr v d, r a, r b form x for 32-bit: if r a = 0 then b 0 else b ( r a) ea b + ( r b) sh ea 28:31 if sh=0x0 then v d 0x101112131415161718191a1b1c1d1e1f if sh=0x1 then v d 0x0f101112131415161718191a1b1c1d1e if sh=0x2 then v d 0x0e0f101112131415161718191a1b1c1d if sh=0x3 then v d 0x0d0e0f101112131415161718191a1b1c if sh=0x4 then v d 0x0c0d0e0f101112131415161718191a1b if sh=0x5 then v d 0x0b0c0d0e0f101112131415161718191a if sh=0x6 then v d 0x0a0b0c0d0e0f10111213141516171819 if sh=0x7 then v d 0x090a0b0c0d0e0f101112131415161718 if sh=0x8 then v d 0x08090a0b0c0d0e0f1011121314151617 if sh=0x9 then v d 0x0708090a0b0c0d0e0f10111213141516 if sh=0xa then v d 0x060708090a0b0c0d0e0f101112131415 if sh=0xb then v d 0x05060708090a0b0c0d0e0f1011121314 if sh=0xc then v d 0x0405060708090a0b0c0d0e0f10111213 if sh=0xd then v d 0x030405060708090a0b0c0d0e0f101112 if sh=0xe then v d 0x02030405060708090a0b0c0d0e0f1011 if sh=0xf then v d 0x0102030405060708090a0b0c0d0e0f10 let the ea be the sum ( r a|0)+( r b). let sh = ea[28?1]. let x be the 32-byte value 0x00 || 0x01 || 0x02 || ... || 0x1e || 0x1f. bytes (16-sh):(31-sh) of x are placed into v d. note that lvsl and lvsr can be used to create the permute control vector to be used by a subsequent vperm instruction. let x and y be the contents of v a and v b speci?d by the vperm . the control vector created by lvsl causes the vperm to select the high-order 16 bytes of the result of shifting the 32-byte value x || y left by sh bytes. the control vector created by vsr causes the vperm to select the low-order 16 bytes of the result of shifting x || y right by sh bytes. these instructions can also be used to rotate or shift the contents of a vector register by sh bytes. for rotating, the vector register to be rotated should be speci?d as both v a and v b for vperm . for shifting left, the v b register for vperm should contain all zeros and v a should contain the value to be shifted, and vice versa for shifting right. figure 6-3 shows a similar instruction only in that ?ure the shift is to the left no other registers altered. 31 v dab 38 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-21 altivec instruction set lvx lvx load vector indexed lvx v d, r a, r b (lru = 0) form x for 32-bitt: if r a=0 then b 0 else b ( r a) ea (b + ( r b)) & (~0xf) if the processor is in big-endian mode then v d mem(ea,16) else v d mem(ea+8,8) || mem(ea,8) let the ea be the result of anding the sum ( r a|0)+( r b) with ~0xf. if the processor is in big-endian mode, the quadword in memory addressed by ea is loaded into v d. if the processor is in little-endian mode, the doubleword addressed by ea is loaded into v d[64?27] and the doubleword addressed by ea+8 is loaded into v d[0?3]. note that normal little-endian powerpc address swizzling is also performed. see section 3.1, ?ata organization in memory,?for more information. figure 6-3 shows this instruction. other registers altered: none 31 v d a b 103 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-22 altivec technology programming environments manual motorola altivec technology programming environments manual lvxl lvxl load vector indexed lru lvxl v d, r a, r b (lru = 1) form x for 32-bit: if r a=0 then b 0 else b ( r a) ea (b + ( r b)) & (~0xf) if the processor is in big-endian mode then v d mem(ea,16) else v d mem(ea+8,8) || mem(ea,8) let the ea be the result of anding the sum ( r a|0)+( r b) with ~0xf. if the processor is in big-endian mode, the quadword addressed by ea is loaded into v d. if the processor is in little-endian mode, the doubleword addressed by ea is loaded into v d[64?27] and the doubleword addressed by ea+8 is loaded into v d[0?3]. note that normal little-endian powerpc address swizzling is also performed. see section 3.1, ?ata organization in memory,?for more information. lvxl provides a hint that the program may not need quadword addressed by ea again soon. note that on some implementations, the hint provided by the lvxl instruction and the corresponding hint provided by the store vector indexed lru ( stvxl ) instruction (see section 5.2.1.2, ?ransient streams? are applied to the entire cache block containing the speci?d quadword. on such implementations, the effect of the hint may be to cause that cache block to be considered a likely candidate for reuse when space is needed in the cache for a new block. thus, on such implementations, the hint should be used with caution if the cache block containing the quadword also contains data that may be needed by the program in the near future. also, the hint may be used before the last reference in a sequence of references to the quadword if the subsequent references are likely to occur suf?iently soon that the cache block containing the quadword is not likely to be displaced from the cache before the last reference. figure 6-3 shows this instruction. other registers altered: none 31 v d a b 359 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-23 altivec instruction set mfvscr mfvscr move from vector status and control register mfvscr v d form vx v d 96 0 || (vscr) the contents of the vscr are placed into v d. note that the programmer should assume that mtvscr and mfvscr take substantially longer to execute than other vx instructions other registers altered: none 04 v d 0_0000 0000_0 1540 056 10 11 15 16 20 21 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-24 altivec technology programming environments manual motorola altivec technology programming environments manual mtvscr mtvscr move to vector status and control register mtvscr v b form vx vscr ( v b) 96:127 the contents of v b are placed into the vscr. other registers altered: none 04 00_000 0_0000 v b 1604 056 10 11 15 16 20 21 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-25 altivec instruction set stvebx stvebx store vector element byte indexed stvebx v s, r a, r b form x for 32-bit: if r a=0 then b 0 else b ( r a) ea b + ( r b) eb ea 28:31 if the processor is in big-endian mode then mem(ea,1) ( v s) eb*8:(eb*8)+7 else mem(ea,1) ( v s) 120-(eb*8):127-eb*8 let the ea be the sum ( r a|0)+( r b). let m = ea[28?1]; m is the byte offset of the byte in its aligned quadword in memory. if the processor is in big-endian mode, byte m of v s is stored into the byte in memory addressed by ea. if the processor is in little-endian mode, byte (15-m) of v s is stored into the byte addressed by ea. figure 6-2 shows how a store instruction is performed for a vector register. other registers altered: none 31 v s a b 135 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-26 altivec technology programming environments manual motorola altivec technology programming environments manual stvehx stvehx store vector element half word indexed stvehx v s, r a, r b form x for 32-bit: if r a=0 then b 0 else b ( r a) ea (b + ( r b)) & (~0x1) eb ea 28:31 if the processor is in big-endian mode then mem(ea,2) ( v s) eb*8:(eb*8)+15 else mem(ea,2) ( v s) 112-eb*8:127-(eb*8) let the ea be the result of anding the sum ( r a|0)+( r b) with ~0x1. let m = ea[28?0]; m is the half-word offset of the half-word in its aligned quadword in memory. if the processor is in big-endian mode, half-word m of v s is stored into the half-word addressed by ea. if the processor is in little-endian mode, half-word (7-m) of v s is stored into the half-word addressed by ea. figure 6-2 shows how a store instruction is performed for a vector register. other registers altered: none 31 v s a b 167 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-27 altivec instruction set stvewx stvewx store vector element word indexed stvewx v s, r a, r b form x for 32-bit: if r a=0 then b 0 else b ( r a) ea (b + ( r b)) & 0xffff_fffc eb ea 28:31 if the processor is in big-endian mode then mem(ea,4) ( v s) eb*8:(eb*8)+31 else mem(ea,4) ( v s) 96-eb*8:127-(eb*8) let the ea be the result of anding the sum ( r a|0)+( r b) with 0xffff_fffc. let m = ea[28-29]; m is the word offset of the word in its aligned quadword in memory. if the processor is in big-endian mode, word m of v s is stored into the word addressed by ea. if the processor is in little-endian mode, word (3-m) of v s is stored into the word addressed by ea. figure 6-2 shows how a store instruction is performed for a vector register. other registers altered: none 31 v s a b 199 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-28 altivec technology programming environments manual motorola altivec technology programming environments manual stvx stvx store vector indexed stvx v s, r a, r b (lru = 0) form x for 32-bit: if r a=0 then b 0 else b ( r a) ea (b + ( r b)) & 0xffff_fff0 if the processor is in big-endian mode then mem(ea,16) ( v s) else mem(ea,16) ( v s) 64:127 || ( v s) 0:63 let the ea be the result of anding the sum ( r a|0)+( r b) with 0xffff_fff0. if the processor is in big-endian mode, the contents of v s are stored into the quadword addressed by ea. if the processor is in little-endian mode, the contents of v s[64?27] are stored into the doubleword addressed by ea, and the contents of v s[0?3] are stored into the doubleword addressed by ea+8. stvxl and stvxlt provide a hint that the quadword addressed by ea will probably not be needed again by the program in the near future. figure 6-2 shows how a store instruction is performed for a vector register. other registers altered: none 31 v s a b 231 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-29 altivec instruction set stvxl stvxl store vector indexed lru stvxl v s, r a, r b (lru = 1) form x for 32-bit: if r a=0 then b 0 else b ( r a) ea (b + ( r b)) & 0xffff_fff0 if the processor is in big-endian mode then mem(ea,16) ( v s) else mem(ea,16) ( v s) 64:127 || ( v s) 0:63 let the ea be the result of anding the sum ( r a|0)+( r b) with 0xffff_fff0. let the ea be the result of anding the sum ( r a|0)+( r b) with 0xffff_ffff_ffff_fff0. if the processor is in big-endian mode, the contents of v s are stored into the quadword addressed by ea. if the processor is in little-endian mode, the contents of v s[64?27] are stored into the doubleword addressed by ea, and the contents of v s[0?3] are stored into the doubleword addressed by ea+8. the stvxl and stvxlt instructions provide a hint that the quad word addressed by ea will probably not be needed again by the program in the near future. note that on some implementations, the hint provided by the stvxl instruction (see section 5.2.2, ?rioritizing cache block replacement? is applied to the entire cache block containing the speci?d quadword. on such implementations, the effect of the hint may be to cause that cache block to be considered a likely candidate for reuse when space is needed in the cache for a new block. thus, on such implementations, the hint should be used with caution if the cache block containing the quadword also contains data that may be needed by the program in the near future. also, the hint may be used before the last reference in a sequence of references to the quadword if the subsequent references are likely to occur suf?iently soon that the cache block containing the quadword is not likely to be displaced from the cache before the last reference. figure 6-2 shows how a store instruction is performed on the vector registers. other registers altered: none 31 v s a b 487 0 056 10 11 15 16 20 21 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-30 altivec technology programming environments manual motorola altivec technology programming environments manual vaddcuw vaddcuw vector add carryout unsigned word vaddcuw v d, v a, v b form vx do i=0 to 127 by 32 aop 0:32 zeroextend(( v a) i:i+31 ,33) bop 0:32 zeroextend(( v b) i:i+31 ,33) temp 0:32 aop 0:32 + int bop 0:32 v d i:i+31 zeroextend(temp 0 ,32) end each unsigned-integer word element in v a is added to the corresponding unsigned-integer word element in v b. the carry out of bit 0 of the 32-bit sum is zero-extended to 32 bits and placed into the corresponding word element of v d. other registers altered: none figure 6-5 shows the usage of the vaddcuw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-5. vaddcuw?etermine carries of four unsigned integer adds (32-bit) 04 v d v a v b 384 056 10 11 15 16 20 21 31 v a v b 33-bit intermedediate v d + + + + f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-31 altivec instruction set vaddfp vaddfp vector add floating point vaddfp v d, v a, v b form vx do i = 0,127,32 ( v d) i:i+31 rndtonearfp32(( v a) i:i+31 + fp ( v b) i:i+31 ) end the four 32-bit ?ating-point values in v a are added to the four 32-bit ?ating-point values in v b. the four intermediate results are rounded and placed in vd. if vscr[nj] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each denormalized result element truncates to a 0 of the same sign. other registers altered: none figure 6-6 shows the usage of the vaddfp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-6. vaddfp?dd four floating-point elements (32-bit) 04 v d v a v b10 056 10 11 15 16 20 21 31 + + + + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-32 altivec technology programming environments manual motorola altivec technology programming environments manual vaddsbs vaddsbs vector add signed byte saturate vaddsbs v d, v a, v b form vx do i=0 to 127 by 8 aop 0:8 signextend(( v a) i:i+7 ,9) bop 0:8 signextend(( v b) i:i+7 ,9) temp 0:8 aop 0:8 + int bop 0:8 v d i:i+7 sitosisat(temp 0:8 ,8) end each element of vaddsbs is a byte. each signed-integer element in v a is added to the corresponding signed-integer element in v b. if the sum is greater than (2 7 -1) it saturates to (2 7 -1) and if it is less than -2 7 it saturates to -2 7 . if saturation occurs, the sat bit is set. the signed-integer result is placed into the corresponding element of v d. other registers altered: vector status and control register (vscr): affected: sat figure 6-7 shows the usage of the vaddsbs instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-7. vaddsbs?dd saturating sixteen signed integer elements (8-bit) 04 v d v a v b 768 056 10 11 15 16 20 21 31 + + + + + + + + + + + + + + + + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-33 altivec instruction set vaddshs vaddshs vector add signed half word saturate vaddshs v d, v a, v b form vx do i=0 to 127 by 16 aop 0:16 signextend(( v a) i:i+15 ,16) bop 0:16 signextend(( v b) i:i+15 ,16) temp 0:16 aop 0:16 + int bop 0:16 v d i:i+15 sitosisat(temp 0:16 ,16) end each element of vaddshs is a half word. each signed-integer element in v a is added to the corresponding signed-integer element in v b. if the sum is greater than (2 15 -1) it saturates to (2 15 -1) and if it is less than -2 15 it saturates to -2 15 . if saturation occurs, the sat bit is set. the signed-integer result is placed into the corresponding element of v d. other registers altered: vector status and control register (vscr): affected: sat figure 6-8 shows the usage of the vaddshs instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-8. vaddshs?add saturating eight signed integer elements (16-bit) 04 v d v a v b 832 056 10 11 15 16 20 21 31 + + + + + + + + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-34 altivec technology programming environments manual motorola altivec technology programming environments manual vaddsws vaddsws vector add signed word saturate vaddsws v d, v a, v b form vx do i=0 to 127 by 32 aop 0:32 signextend(( v a) i:i+31 ,33) bop 0:32 signextend(( v b) i:i+31 ,33) temp 0:32 aop 0:32 + int bop 0:32 v d i:i+31 sitosisat(temp 0:32 ,32) end each element of vaddsws is a word. each signed-integer element in v a is added to the corresponding signed-integer element in v b. if the sum is greater than (2 31 -1) it saturates to (2 31 -1) and if it is less than (-2 31) it saturates to (-2 31 ). if saturation occurs, the sat bit is set. the signed-integer result is placed into the corresponding element of v d. other registers altered: vector status and control register (vscr): affected: sat figure 6-9 shows the usage of the vaddsws instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-9. vaddsws?dd saturating four signed integer elements (32-bit) 04 v d v a v b 896 056 10 11 15 16 20 21 31 + + + + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-35 altivec instruction set vaddubm vaddubm vector add unsigned byte modulo vaddubm v d, v a, v b form vx do i=0 to 127 by 8 v d i:i+7 ( v a) i:i+7 + int ( v b) i:i+7 end each element of vaddubm is a byte. each integer element in v a is modulo added to the corresponding integer element in v b. the integer result is placed into the corresponding element of v d. note that the vaddubm instruction can be used for unsigned or signed integers. other registers altered: none figure 6-10 shows the vaddubm instruction usage. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-10. vaddubm?dd sixteen integer elements (8-bit) 04 v d v a v b0 056 10 11 15 16 20 21 31 + + + + + + + + + + + + + + + + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-36 altivec technology programming environments manual motorola altivec technology programming environments manual vaddubs vaddubs vector add unsigned byte saturate vaddubs v d, v a, v b form vx do i=0 to 127 by 8 aop 0:8 zeroextend(( v a) i:i+7 ,9) bop 0:8 zeroextend(( v b) i:i+7 ,9) temp 0:8 aop 0:8 + int bop 0:8 v d i:i+7 uitouisat(temp 0:8 ,8) end each element of vaddubs is a byte. each unsigned-integer element in v a is added to the corresponding unsigned-integer element in v b. if the sum is greater than (2 8 -1) it saturates to (2 8 -1) and the sat bit is set. the unsigned-integer result is placed into the corresponding element of v d. other registers altered: vector status and control register (vscr): affected: sat figure 6-11 shows the usage of the vaddubs instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-11. vaddubs?dd saturating sixteen unsigned integer elements (8-bit) 04 v d v a v b 512 056 10 11 15 16 20 21 31 + + + + + + + + + + + + + + + + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-37 altivec instruction set vadduhm vadduhm vector add unsigned half word modulo vadduhm v d, v a, v b form vx do i=0 to 127 by 16 v d i:i+15 ( v a) i:i+15 + int ( v b) i:i+15 end each element of vadduhm is a half word. each integer element in v a is added to the corresponding integer element in v b. the integer result is placed into the corresponding element of v d. note that the vadduhm instruction can be used for unsigned or signed integers. other registers altered: none figure 6-12 shows the usage of the vadduhm instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-12. vadduhm?dd eight integer elements (16-bit) 04 v d v a v b64 056 10 11 15 16 20 21 31 + + + + + + + + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-38 altivec technology programming environments manual motorola altivec technology programming environments manual vadduhs vadduhs vector add unsigned half word saturate vadduhs v d, v a, v b form vx do i=0 to 127 by 16 aop 0:16 zeroextend(( v a) i:i+15 ,17) bop 0:16 zeroextend(( v b) i:i+15 ,17) temp 0:16 aop 0:16 + int bop 0:16 v d i:i+15 uitouisat(temp 0:16 ,16) end each element of vadduhs is a half word. each unsigned-integer element in v a is added to the corresponding unsigned-integer element in v b. if the sum is greater than (2 16 -1) it saturates to (2 16 -1) and the sat bit is set. the unsigned-integer result is placed into the corresponding element of v d. other registers altered: vector status and control register (vscr): affected: sat figure 6-13 shows the usage of the vadduhs instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-13. vadduhs?dd saturating eight unsigned integer elements (16-bit) 04 v d v a v b 576 056 10 11 15 16 20 21 31 + + + + + + + + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-39 altivec instruction set vadduwm vadduwm vector add unsigned word modulo vadduwm v d, v a, v b form: vx do i=0 to 127 by 32 v d i:i+31 ( v a) i:i+31 + int ( v b) i:i+31 end each element of vadduwm is a word. each integer element in v a is modulo added to the corresponding integer element in v b. the integer result is placed into the corresponding element of v d. note that the vadduwm instruction can be used for unsigned or signed integers. other registers altered: none form: ?x figure 6-14 shows the usage of the vadduwm instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-14. vadduwm?dd four integer elements (32-bit) 04 v d v a v b 128 056 10 11 15 16 20 21 31 + + + + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-40 altivec technology programming environments manual motorola altivec technology programming environments manual vadduws vadduws vector add unsigned word saturate vadduws v d, v a, v b form: vx do i=0 to 127 by 3 aop 0:32 zeroextend(( v a) i:i+31 ,33) bop 0:32 zeroextend(( v b) i:i+31 ,33) temp 0:32 aop 0:32 + int bop 0:32 v d i:i+31 uitouisat(temp 0:32 ,32) end each element of vadduws is a word. each unsigned-integer element in v a is added to the corresponding unsigned-integer element in v b. if the sum is greater than (2 32 -1) it saturates to (2 32 -1) and the sat bit is set. the unsigned-integer result is placed into the corresponding element of v d. other registers altered: vector status and control register (vscr): affected: sat figure 6-15 shows the usage of the vadduws instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-15. vadduws?dd saturating four unsigned integer elements (32-bit) 04 v d v a v b 640 056 10 11 15 16 20 21 31 + + + + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-41 altivec instruction set vand vand vector logical and vand v d, v a, v b form: vx v d ( v a) & ( v b) the contents of v a are bitwise anded with the contents of v b and the result is placed into v d. other registers altered: none figure 6-16 shows usage of the vand instruction. figure 6-16. vand?ogical bitwise and 04 v d v a v b 1028 056 10 11 15 16 20 21 31 & v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-42 altivec technology programming environments manual motorola altivec technology programming environments manual vandc vandc vector logical and with complement vandc v d, v a, v b form: vx v d ( v a) & ? ( v b) the contents of v a are anded with the ones complement of the contents of v b and the result is placed into v d. other registers altered: none figure 6-16 shows usage of the vandc instruction. figure 6-17. vand?ogical bitwise and with complement 04 v d v a v b 1092 056 10 11 15 16 20 21 31 & v b intermediate v a v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-43 altivec instruction set vavgsb vavgsb vector average signed byte vavgsb v d, v a, v b form: vx do i=0 to 127 by 8 aop 0:8 signextend(( v a) i:i+7 ,9) bop 0:8 signextend(( v b) i:i+7 ,9) temp 0:8 aop 0:8 + int bop 0:8 + int 1 v d i:i+7 temp 0:7 end each element of vavgsb is a byte. each signed-integer byte element in v a is added to the corresponding signed-integer byte element in v b, producing an 9-bit signed-integer sum. the sum is incremented by 1. the high-order 8 bits of the result are placed into the corresponding element of v d. other registers altered: none figure 6-18 shows the usage of the vavgsb instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-18. vavgsb?average sixteen signed integer elements (8-bit) 04 v d v a v b 1282 056 10 11 15 16 20 21 31 + + + + + + + + + + + + + + + + v a v b vd +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 temp temp 8 bits 9 bits f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-44 altivec technology programming environments manual motorola altivec technology programming environments manual vavgsh vavgsh vector average signed half word vavgsh v d, v a, v b form: vx do i=0 to 127 by 16 aop 0:16 signextend(( v a) i:i+15 ,17) bop 0:16 signextend(( v b) i:i+15 ,17) temp 0:16 aop 0:15 + int bop 0:15 + int 1 v d i:i+15 temp 0:15 end each element of vavgsh is a half word. each signed-integer element in v a is added to the corresponding signed-integer element in v b, producing an 17-bit signed-integer sum. the sum is incremented by 1. the high-order 16 bits of the result are placed into the corresponding element of v d. other registers altered: none figure 6-19 shows the usage of the vavgsh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-19. vavgsh?verage eight signed integer elements (16-bits) 04 v d v a v b 1346 056 10 11 15 16 20 21 31 + + + + + + + v a v b +1 +1 +1 +1 +1 +1 +1 temp 16 bits 17 bits + +1 temp f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-45 altivec instruction set vavgsw vavgsw vector average signed word vavgsw v d, v a, v b form: vx do i=0 to 127 by 32 aop 0:32 signextend(( v a) i:i+31 ,33) bop 0:32 signextend(( v b) i:i+31 ,33) temp 0:32 aop 0:32 + int bop 0:32 + int 1 v d i:i+31 temp 0:31 end each element of vavgsw is a word. each signed-integer element in v a is added to the corresponding signed-integer element in v b, producing an 33-bit signed-integer sum. the sum is incremented by 1. the high-order 32 bits of the result are placed into the corresponding element of v d. other registers altered: none figure 6-20 shows the usage of the vavgsw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-20. vavgsw?average four signed integer elements (32-bit) 04 v d v a v b 1410 056 10 11 15 16 20 21 31 + + + v a v b +1 +1 +1 temp 32 bits 33 bits + +1 temp f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-46 altivec technology programming environments manual motorola altivec technology programming environments manual vavgub vavgub vector average unsigned byte vavgub v d, v a, v b form: vx do i=0 to 127 by 8 aop 0:8 zeroextend(( v a) i:i+7 ,9) bop 0:n zeroextend(( v b) i:i+71 ,9) temp 0:n aop 0:8 + int bop 0:8 + int 1 v d i:i+7 temp 0:7 end each element of vavgub is a byte. each unsigned-integer element in v a is added to the corresponding unsigned-integer element in v b, producing an 9-bit unsigned-integer sum. the sum is incremented by 1. the high-order 8 bits of the result are placed into the corresponding element of v d. other registers altered: none figure 6-21 shows the usage of the vavgub instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. . figure 6-21. vavgub?verage sixteen unsigned integer elements (8-bits) 04 v d v a v b 1026 056 10 11 15 16 20 21 31 + + + + + + + + + + + + + + + + v a v b vd +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 +1 temp temp 8 bits 9 bits f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-47 altivec instruction set vavguh vavguh vector average unsigned half word vavguh v d, v a, v b form: vx do i=0 to 127 by 16 aop 0:16 zeroextend(( v a) i:i+15 ,17) bop 0:16 zeroextend(( v b) i:i+15 ,17) temp 0:16 aop 0:16 + int bop 0:16 + int 1 v d i:i+15 temp 0:15 end each element of vavguh is a half word. each unsigned-integer element in v a is added to the corresponding unsigned-integer element in v b, producing a 17-bit unsigned-integer. the sum is incremented by 1. the high-order 16 bits of the result are placed into the corresponding element of v d. other registers altered: none figure 6-22 shows the usage of the vavgsh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-22. vavgsh?average eight signed integer elements (16-bit) 04 v d v a v b 1090 056 10 11 15 16 20 21 31 + + + + + + + v a v b +1 +1 +1 +1 +1 +1 +1 temp 16 bits 17 bits + +1 temp f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-48 altivec technology programming environments manual motorola altivec technology programming environments manual vavguw vavguw vector average unsigned word vavguw v d, v a, v b form: vx do i=0 to 127 by 32 aop 0:32 zeroextend(( v a) i:i+31 ,33) bop 0:32 zeroextend(( v b) i:i+31 ,33) temp 0:32 aop 0:32 + int bop 0:32 + int 1 v d i:i+31 temp 0:31 end each element of vavguw is a word. each unsigned-integer element in v a is added to the corresponding unsigned-integer element in v b, producing an 33-bit unsigned-integer sum. the sum is incremented by 1. the high-order 32 bits of the result are placed into the corresponding element of v d. other registers altered: none figure 6-23 shows the usage of the vavguw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-23. vavguw?verage four unsigned integer elements (32-bit) 04 v d v a v b 1154 056 10 11 15 16 20 21 31 + + + v a v b +1 +1 +1 temp 32 bits 33 bits + +1 temp f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-49 altivec instruction set vcfsx vcfsx vector convert from signed fixed-point word vcfsx v d, v b,uimm form: vx do i=0 to 127 by 32 v d i:i+31 cnvtsi32tofp32(( v b) i:i+31 ) fp 2 uimm end each signed ?ed-point integer word element in v b is converted to the nearest single-precision ?ating-point value. the result is divided by 2 uimm (uimm = unsigned immediate value) and placed into the corresponding word element of v d. other registers altered: none figure 6-24 shows the usage of the vcfsx instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-24. vcfsx?onvert four signed integer elements to four floating-point elements (32-bit) 04 v d uimm v b 842 056 10 11 15 16 20 21 31 v b v d scale factor from opcode ( 2 uimm ) f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-50 altivec technology programming environments manual motorola altivec technology programming environments manual vcfux vcfux vector convert from unsigned fixed-point word vcfux v d, v b,uimm form: vx do i=0 to 127 by 32 v d i:i+31 cnvtui32tofp32(( v b) i:i+31 ) fp 2 uimm end each unsigned ?ed-point integer word element in v b is converted to the nearest single-precision ?ating-point value. the result is divided by 2 uimm and placed into the corresponding word element of v d. other registers altered: none figure 6-25 shows the usage of the vcfux instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-25. vcfux?onvert four unsigned integer elements to four floating-point elements (32-bit) 04 v d uimm v b 778 056 10 11 15 16 20 21 31 v b v d scale factor from opcode ( 2 uimm ) f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-51 altivec instruction set vcmpbfp x vcmpbfp x vector compare bounds floating point vcmpbfp v d, v a, v b (rc = 0) form: vxr vcmpbfp. v d, v a, v b (rc = 1) do i=0 to 127 by 32 le (( v a) i:i+31 fp ( v b) i:i+31 ) ge (( v a) i:i+31 fp -( v b) i:i+31 ) v d i:i+31 ? le || ? ge || 30 0 end if rc=1 then do ib ( v d = 128 0) cr 24:27 0b00 || ib || 0b0 end each single-precision word element in v a is compared to the corresponding element in v b. a 2-bit value is formed that indicates whether the element in v a is within the bounds speci?d by the element in v b, as follows. bit 0 of the 2-bit value is zero if the element in v a is less than or equal to the element in v b, and is one otherwise. bit 1 of the 2-bit value is zero if the element in v a is greater than or equal to the negative of the element in v b, and is one otherwise. the 2-bit value is placed into the high-order two bits of the corresponding word element (bits 0? for word element 0, bits 32?3 for word element 1, bits 64?5 for word element 2, bits 96?7 for word element 3) of v d and the remaining bits of the element are cleared. if rc=1, cr field 6 is set to indicate whether all four elements in v a are within the bounds speci?d by the corresponding element in v b, as follows. cr6 = 0b00 || all_within_bounds || 0 note that if any single-precision ?ating-point word element in v b is negative; the corresponding element in v a is out of bounds. note that if a v a or a v b element is a nan, the two high order bits of the corresponding result will both have the value 1. if vscr[nj] = 1, every denormalized operand element is truncated to 0 before the comparison is made. other registers altered: condition register (cr6): affected: bit 2 (if rc = 1) 04 v d v a v b rc 966 056 10 11 15 16 20 21 22 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-52 altivec technology programming environments manual motorola altivec technology programming environments manual figure 6-26 shows the usage of the vcmpbfp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-26. vcmpbfp?ompare bounds of four floating-point elements (32-bit) v a v b v d 0 32 64 96 1 33 65 97 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-53 altivec instruction set vcmpeqfp x vcmpeqfp x vector compare equal-to-floating point vcmpeqfp v d, v a, v b form: vxr vcmpeqfp. v d, v a, v b do i=0 to 127 by 32 if ( v a) i:i+31 = fp ( v b) i:i+31 then v d i:i+31 0xffff_ffff else v d i:i+31 0x0000_0000 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr 24:27 t || 0b0 || f || 0b0 end each single-precision ?ating-point word element in v a is compared to the corresponding single-precision ?ating-point word element in v b. the corresponding word element in v d is set to all 1s if the element in v a is equal to the element in v b, and is cleared to all 0s otherwise. if rc = 1. cr6 ?ed is set according to all, some, or none of the elements pairs compare equal. cr6 = all_equal || 0b0 || none_equal || 0b0 note that if a v a or v b element is a nan, the corresponding result will be 0x0000_0000. other registers altered: condition register (cr6): affected: bits 0-3 (if rc = 1) figure 6-27 shows the usage of the vcmpeqfp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-27. vcmpeqfp?ompare equal of four floating-point elements (32-bit) 04 v d v a v b rc 198 056 10 11 15 16 20 21 22 31 = = = = v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-54 altivec technology programming environments manual motorola altivec technology programming environments manual vcmpequb x vcmpequb x vector compare equal-to unsigned byte vcmpequb v d, v a, v b form: vxr vcmpequb. v d, v a, v b do i=0 to 127 by 8 if ( v a) i:i+7 = int ( v b) i:i+7 then v d i:i+7 8 1 else v d i:i+7 8 0 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr[24:27] t || 0b0 || f || 0b0 end each element of vcmpequb is a byte. each integer element in v a is compared to the corresponding integer element in v b. the corresponding element in v d is set to all 1s if the element in v a is equal to the element in v b, and is cleared to all 0s otherwise. the cr6 is set according to whether all, some, or none of the elements compare equal. cr6 = all_equal || 0b0 || none_equal || 0b0 note that vcmpequb [.] can be used for unsigned or signed integers. other registers altered: condition register (cr6): affected: bits 0? (if rc = 1) figure 6-28 shows the usage of the vcmpequb instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-28. vcmpequb?ompare equal of sixteen integer elements (8-bits) 04 v d v a v brc 6 056 10 11 15 16 20 21 22 31 = = = = = = = = = = = = = = = = v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-55 altivec instruction set vcmpequh x vcmpequh x vector compare equal-to unsigned half word vcmpequh v d, v a, v b form: vxr vcmpequh. v d, v a, v b do i=0 to 127 by 16 if ( v a) i:i+15 = int ( v b) i:i+15 then v d i:i+15 16 1 else v d i:i+15 16 0 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr[24:27] t || 0b0 || f || 0b0 end each element of vcmpequh is a half word. each integer element in v a is compared to the corresponding integer element in v b. the corresponding element in v d is set to all 1s if the element in v a is equal to the element in v b, and is cleared to all 0s otherwise. the cr6 is set according to whether all, some, or none of the elements compare equal. cr6 = all_equal || 0b0 || none_equal || 0b0. note that vcmpequh [ . ] can be used for unsigned or signed integers. other registers altered: condition register (cr6): affected: bits 0? (if rc = 1) figure 6-29 shows the usage of the vcmpequh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-29. vcmpequh?ompare equal of eight integer elements (16-bit) 04 v d v a v brc 70 056 10 11 15 16 20 21 22 31 = = = = = = = = v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-56 altivec technology programming environments manual motorola altivec technology programming environments manual vcmpequw x vcmpequw x vector compare equal-to unsigned word vcmpequw v d, v a, v b form: vxr vcmpequw. v d, v a, v b do i=0 to 127 by 32 if ( v a) i:i+311 = int ( v b) i:i+31 then v d i:i+31 n 1 else v d i:i+31 n 0 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr[24:27] t || 0b0 || f || 0b0 end each element of vcmpequw is a word. each integer element in v a is compared to the corresponding integer element in v b. the corresponding element in v d is set to all 1s if the element in v a is equal to the element in v b, and is cleared to all 0s otherwise. the cr6 is set according to whether all, some, or none of the elements compare equal. cr6 = all_equal || 0b0 || none_equal || 0b0 note that vcmpequw [.] can be used for unsigned or signed integers. other registers altered: condition register (cr6): affected: bits 0-3 (if rc = 1) figure 6-30 shows the usage of the vcmpequw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-30. vcmpequw?ompare equal of four integer elements (32-bit) 04 v d v a v b rc 134 056 10 11 15 16 20 21 22 31 = = = = v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-57 altivec instruction set vcmpgefp x vcmpgefp x vector compare greater-than-or-equal-to floating point vcmpgefp v d, v a, v b (rc = 0) form: vxr vcmpgefp. v d, v a, v b (rc = 1) do i=0 to 127 by 32 if ( v a) i:i+31 fp ( v b) i:i+31 then v d i:i+31 0xffff_ffff else v d i:i+31 0x0000_0000 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr 24:27 t || 0b0 || f || 0b0 end each single-precision ?ating-point word element in v a is compared to the corresponding single-precision ?ating-point word element in v b. the corresponding word element in v d is set to all 1s if the element in v a is greater than or equal to the element in v b, and is cleared to all 0s otherwise. if rc = 1, cr6 is set according to all_greater_or_equal || some_greater_or_equal || none_great_or_equal. cr6 = all_greater_or_equal || 0b0 || none greater_or_equal || 0b0. note that if a v a or v b element is a nan, the corresponding results will be 0x0000_0000. other registers altered: condition register (cr6): affected: bits 0-3 (if rc = 1) figure 6-31 shows the usage of the vcmpgefp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long figure 6-31. vcmpgefp?ompare greater-than-or-equal of four floating-point elements (32-bit) 04 v d v a v b rc 454 056 10 11 15 16 20 21 22 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-58 altivec technology programming environments manual motorola altivec technology programming environments manual vcmpgtfp x vcmpgtfp x vector compare greater-than floating-point vcmpgtfp v d, v a, v b form: vxr vcmpgtfp. v d, v a, v b do i=0 to 127 by 32 if ( v a) i:i+31 > fp ( v b) i:i+31 then v d i:i+31 0xffff_ffff else v d i:i+31 0x0000_0000 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr[24:27] t || 0b0 || f || 0b0 end each single-precision ?ating-point word element in v a is compared to the corresponding single-precision ?ating-point word element in v b. the corresponding word element in v d is set to all 1s if the element in v a is greater than the element in v b, and is cleared to all 0s otherwise. if rc = 1, cr6 is set according to all_greater_than || some_greater_than || none_greater_than. cr6 = all_greater_than || 0b0 || none greater_than || 0b0. note that if a v a or v b element is a nan, the corresponding results will be 0x0000_0000. other registers altered: condition register (cr6): affected: bits 0-3 (if rc = 1) figure 6-32 shows the usage of the vcmpgtfp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-32. vcmpgtfp?ompare greater-than of four floating-point elements (32-bit) 04 v d v a v b rc 710 056 10 11 15 16 20 21 22 31 > > > > v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-59 altivec instruction set vcmpgtsb x vcmpgtsb x vector compare greater-than signed byte vcmpgtsb v d, v a, v b form: vxr vcmpgtsb. v d, v a, v b do i=0 to 127 by 8 if ( v a) i:i+7 > si ( v b) i:i+7 then v d i:i+7 8 1 else v d i:i+7 8 0 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr 24:27 t || 0b0 || f || 0b0 end each element of vcmpgtsb is a byte. each signed-integer element in v a is compared to the corresponding signed-integer element in v b. the corresponding element in v d is set to all 1s if the element in v a is greater than the element in v b, and is cleared to all 0s otherwise. if rc = 1, cr6 is set according to all_greater_than || some_greater_than || none_great_than. cr6 = all_greater_than || 0b0 || none greater_than || 0b0. note that if a v a or v b element is a nan, the corresponding results will be 0x0000_0000. other registers altered: condition register (cr6): affected: bits 0-3 (if rc = 1) figure 6-33 shows the usage of the vcmpgtsb instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-33. vcmpgtsb?ompare greater-than of sixteen signed integer elements (8-bit) 04 v d v a v b rc 774 056 10 11 15 16 20 21 22 31 > > > > > > > > > > > > > > > > v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-60 altivec technology programming environments manual motorola altivec technology programming environments manual vcmpgtsh x vcmpgtsh x vector compare greater-than condition register signed half word vcmpgtsh v d, v a, v b form: vxr vcmpgtsh. v d, v a, v b do i=0 to 127 by 16 if ( v a) i:i+15 > si ( v b) i:i+15 then v d i:i+15 16 1 else v d i:i+15 16 0 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr 24:27 t || 0b0 || f || 0b0 end each element of vcmpgtsh is a half word. each signed-integer element in v a is compared to the corresponding signed-integer element in v b. the corresponding element in v d is set to all 1s if the element in v a is greater than the element in v b, and is cleared to all 0s otherwise. if rc = 1, cr6 is set according to all_greater_than || some_greater_than || none_great_than. cr6 = all_greater_than || 0b0 || none greater_than || 0b0. note that if a v a or v b element is a nan, the corresponding results will be 0x0000_0000. other registers altered: condition register (cr6): affected: bits 0-3 (if rc = 1) figure 6-34 shows the usage of the vcmpgtsh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-34. vcmpgtsh?ompare greater-than of eight signed integer elements (16-bit) 04 v d v a v b rc 838 056 10 11 15 16 20 21 22 31 > > > > > > > > v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-61 altivec instruction set vcmpgtsw x vcmpgtsw x vector compare greater-than signed word vcmpgtsw v d, v a, v b form: vxr vcmpgtsw. v d, v a, v b do i=0 to 127 by 32 if ( v a) i:i+31 > si ( v b) i:i+31 then v d i:i+31 32 1 else v d i:i+31 32 0 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr 24:27 t || 0b0 || f || 0b0 end each element of vcmpgtsw is a word. each signed-integer element in v a is compared to the corresponding signed-integer element in v b. the corresponding element in v d is set to all 1s if the element in v a is greater than the element in v b, and is cleared to all 0s otherwise. if rc = 1, cr6 is set according to all_greater_than || some_greater_than || none_great_than. cr6 = all_greater_than || 0b0 || none greater_than || 0b0. note that if a v a or v b element is a nan, the corresponding results will be 0x0000_0000. other registers altered: condition register (cr6): affected: bits 0-3 (if rc = 1) figure 6-35 shows the usage of the vcmpgtsw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-35. vcmpgtsw?ompare greater-than of four signed integer elements (32-bit) 04 v d v a v b rc 902 056 10 11 15 16 20 21 22 31 > > > > v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-62 altivec technology programming environments manual motorola altivec technology programming environments manual vcmpgtub x vcmpgtub x vector compare greater-than unsigned byte vcmpgtub v d, v a, v b form: vxr vcmpgtub. v d, v a, v b do i=0 to 127 by 8 if ( v a) i:i+7 > ui ( v b) i:i+7 then v d i:i+7 8 1 else v d i:i+7 8 0 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr[24?7] t || 0b0 || f || 0b0 end each element of vcmpgtub is a byte. each unsigned-integer element in v a is compared to the corresponding unsigned-integer element in v b. the corresponding element in v d is set to all 1s if the element in v a is greater than the element in v b, and is cleared to all 0s otherwise. if rc = 1, cr6 is set according to all_greater_than || some_greater_than || none_great_than. cr6 = all_greater_than || 0b0 || none greater_than || 0b0. note that if a v a or v b element is a nan, the corresponding results will be 0x0000_0000. other registers altered: condition register (cr6): affected: bits 0-3 (if rc = 1) figure 6-36 shows the usage of the vcmpgtub instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-36. vcmpgtub?ompare greater-than of sixteen unsigned integer elements (8-bit) 04 v d v a v b rc 518 056 10 11 15 16 20 21 22 31 > > > > > > > > > > > > > > > > v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-63 altivec instruction set vcmpgtuh x vcmpgtuh x vector compare greater-than unsigned half word vcmpgtuh v d, v a, v b form: vxr vcmpgtuh. v d, v a, v b do i=0 to 127 by 16 if ( v a) i:i+151 > ui ( v b) i:i+15 then v d i:i+15 16 1 else v d i:i+15 16 0 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr[24?7] t || 0b0 || f || 0b0 end each element of vcmpgtuh is a half word. each unsigned-integer element in v a is compared to the corresponding unsigned-integer element in v b. the corresponding element in v d is set to all 1s if the element in v a is greater than the element in v b, and is cleared to all 0s otherwise. if rc = 1, cr6 is set according to all_greater_than || some_greater_than || none_great_than. cr6 = all_greater_than || 0b0 || none greater_than || 0b0. note that if a v a or v b element is a nan, the corresponding results will be 0x0000_0000. other registers altered: condition register (cr6): affected: bits 0-3 (if rc = 1) figure 6-37 shows the usage of the vcmpgtuh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-37. vcmpgtuh?ompare greater-than of eight unsigned integer elements (16-bit) 04 v d v a v b rc 582 056 10 11 15 16 20 21 22 31 > > > > > > > > v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-64 altivec technology programming environments manual motorola altivec technology programming environments manual vcmpgtuw x vcmpgtuw x vector compare greater-than unsigned word vcmpgtuw v d, v a, v b form: vxr vcmpgtuw. v d, v a, v b do i=0 to 127 by 32 if ( v a) i:i+31 > ui ( v b) i:i+31 then v d i:i+31 32 1 else v d i:i+31 32 0 end if rc=1 then do t ( v d = 128 1) f ( v d = 128 0) cr[24?7] t || 0b0 || f || 0b0 end each element of vcmpgtuw is a word. each unsigned-integer element in v a is compared to the corresponding unsigned-integer element in v b. the corresponding element in v d is set to all 1s if the element in v a is greater than the element in v b, and is cleared to all 0s otherwise. if rc = 1, cr6 is set according to all_greater_than || some_greater_than || none_great_than. cr6 = all_greater_than || 0b0 || none_greater_than || 0b0. note that if a v a or v b element is a nan, the corresponding results will be 0x0000_0000. other registers altered: condition register (cr6): affected: bits 0-3 (if rc = 1) figure 6-38 shows the usage of the vcmpgtuw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-38. vcmpgtuw?ompare greater-than of four unsigned integer elements (32-bit) 04 v d v a v b rc 646 056 10 11 15 16 20 21 22 31 > > > > v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-65 altivec instruction set vctsxs vctsxs vector convert to signed fixed-point word saturate vctsxs v d, v b,uimm form: vx do i=0 to 127 by 32 if ( v b) i+1:i+8 =255 | ( v b) i+1:i+8 + uimm 254 then v d i:i+31 cnvtfp32tosi32sat(( v b) i:i+31 * fp 2 uimm ) else do if ( v b) i =0 then v d i:i+31 0x7fff_ffff else v d i:i+31 0x8000_0000 vscr sat 1 end end each single-precision word element in v b is multiplied by 2 uimm . the product is converted to a signed integer using the rounding mode, round toward zero. if the intermediate result is greater than (2 31 -1) it saturates to (2 31 -1); if it is less than -2 31 it saturates to -2 31 . a signed-integer result is placed into the corresponding word element of v d. fixed-point integers used by the vector convert instructions can be interpreted as consisting of 32-uimm integer bits followed by uimm fraction bits. the vector convert to fixed-point word instructions support only the rounding mode, round toward zero. a single-precision number can be converted to a fixed-point integer using any of the other three rounding modes by executing the appropriate vector round to floating-point integer instruction before the vector convert to fixed-point word instruction. other registers altered: vector status and control register (vscr): affected: sat figure 6-39 shows the usage of the vctsxs instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-39. vctsxs?onvert four floating-point elements to four signed integer elements (32-bit) 04 v d uimm v b 970 056 10 11 15 16 20 21 31 v b v d x x x x scale factor from opcode ( 2 uimm ) f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-66 altivec technology programming environments manual motorola altivec technology programming environments manual vctuxs vctuxs vector convert to unsigned fixed-point word saturate vctuxs v d ,v b , uimm form: vx do i=0 to 127 by 32 if ( v b) i+1:i+8 =255 | ( v b) i+1:i+8 + uimm 254 then v d i:i+31 cnvtfp32toui32sat(( v b) i:i+31 * fp 2 uim ) else do if ( v b) i =0 then v d i:i+31 0xffff_ffff else v d i:i+31 0x0000_0000 vscr sat 1 end end each single-precision ?ating-point word element in v b is multiplied by 2 uim . the product is converted to an unsigned ?ed-point integer using the rounding mode round toward zero. if the intermediate result is greater than (2 32 -1) it saturates to (2 32 -1) and if it is less than 0 it saturates to 0. the unsigned-integer result is placed into the corresponding word element of v d. other registers altered: vector status and control register (vscr): affected: sat figure 6-40 shows the usage of the vctuxs instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-40. vctuxs?onvert four floating-point elements to four unsigned integer elements (32-bit) 04 v d uimm v b 906 056 10 11 15 16 20 21 31 v b v d x x x x scale factor from opcode ( 2 uimm ) f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-67 altivec instruction set vexptefp vexptefp vector 2 raised to the exponent estimate floating point vexptefp v d, v b form: vx do i=0 to 127 by 32 x ( v b) i:i+31 v d i:i+31 2 x end the single-precision ?ating-point estimate of 2 raised to the power of each single-precision ?ating-point element in v b is placed into the corresponding element of v d. the estimate has a relative error in precision no greater than one part in 16, that is, where x is the value of the element in v b. the most signi?ant 12 bits of the estimate's signi?ant are monotonic. note that the value placed into the element of v d may vary between implementations, and between different executions on the same implementation. if an operation has an integral value and the resulting value is not 0 or + , the result is exact. operation with various special values of the element in v b is summarized in table 6-5 below. if vscr[nj] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each denormalized result element truncates to a 0 of the same sign. 04 v d 0_0000 v b 394 056 10 11 15 16 20 21 31 table 6-5. special values of the element in vb value of element in vb result - +0 -0 +1 +0 +1 + + nan qnan estimate 2 x 2 x ------------------------------------ - 1 16 ------ f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-68 altivec technology programming environments manual motorola altivec technology programming environments manual other registers altered: none figure 6-41 shows the usage of the vexptefp instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-41. vexptefp? raised to the exponent estimate floating-point for four floating-point elements (32-bit) 2 x 2 x 2 x 2 x v b v d xx xx f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-69 altivec instruction set vlogefp vlogefp vector log 2 estimate floating point vlogefp v d, v b form: vx do i=0 to 127 by 32 x ( v b) i:i+31 v d i:i+31 log 2 (x) end the single-precision ?ating-point estimate of the base 2 logarithm of each single-precision ?ating-point element in v b is placed into the corresponding element of v d. the estimate has an absolute error in precision (absolute value of the difference between the estimate and the in?itely precise value) no greater than 2 -5 . the estimate has a relative error in precision no greater than one part in 8, as described below: where x is the value of the element in v b, except when | x -1| 1 8. the most signi?ant 12 bits of the estimate's signi?ant are monotonic. note that the value placed into the element of v d may vary between implementations, and between different executions on the same implementation. operation with various special values of the element in v b is summarized below in table 6-6. if vscr[nj] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each denormalized result element truncates to a 0 of the same sign. 04 v d 0_0000 v b 458 056 10 11 15 16 20 21 31 table 6-6. special values of the element in vb value result - qnan less than 0 qnan 0- + + nan qnan estimate - log 2 x () 1 32 ----- - ?? ?? unless x 1 1 8 -- - f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-70 altivec technology programming environments manual motorola altivec technology programming environments manual other registers altered: none figure 6-42 shows the usage of the vexptefp instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-42. vexptefp?og 2 estimate floating-point for four floating-point elements (32-bit) log 2 (x) log 2 (x) log 2 (x) log 2 (x) v b v d xx x x f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-71 altivec instruction set vmaddfp vmaddfp vector multiply add floating point vmaddfp v d, v a, v c, v b form: va do i=0 to 127 by 32 v d i:i+31 rndtonearfp32((( v a) i:i+31 * fp ( v c) i:i+31 ) + fp ( v b) i:i+31 ) end each single-precision ?ating-point word element in v a is multiplied by the corresponding single-precision ?ating-point word element in v c. the corresponding single-precision ?ating-point word element in v b is added to the product. the result is rounded to the nearest single-precision ?ating-point number and placed into the corresponding word element of v d. note that a vector multiply ?ating-point instruction is not provided. the effect of such an instruction can be obtained by using vmaddfp with v b containing the value -0.0 (0x8000_0000) in each of its four single-precision ?ating-point word elements. (the value must be -0.0, not +0.0, in order to obtain the ieee-conforming result of -0.0 when the result of the multiplication is -0.) other registers altered: none if vscr[nj] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each denormalized result element truncates to a 0 of the same sign. figure 6-43 shows the usage of the vmaddfp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-43. vmaddfp?ultiply-add four floating-point elements (32-bit) 04 v d v a v b v c46 056 10 11 15 16 20 21 26 31 + prod v b v d * * * * + + + v c v a f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-72 altivec technology programming environments manual motorola altivec technology programming environments manual vmaxfp vmaxfp vector maximum floating point vmaxfp v d, v a, v b form: vx do i=0 to 127 by 32 if ( v a) i:i+31 fp ( v b) i:i+31 then v d i:i+31 ( v a) i:i+31 else v d i:i+31 ( v b) i:i+31 end each single-precision ?ating-point word element in v a is compared to the corresponding single-precision ?ating-point word element in v b. the larger of the two single-precision ?ating-point values is placed into the corresponding word element of v d. the maximum of +0 and -0 is +0. the maximum of any value and a nan is a qnan. other registers altered: none figure 6-44 shows the usage of the vmaxfp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-44. vmaxfp?aximum of four floating-point elements (32-bit) 04 v d v a v b 1034 056 10 11 15 16 20 21 31 fp fp fp fp v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-73 altivec instruction set vmaxsb vmaxsb vector maximum signed byte vmaxsb v d, v a, v b form: vx do i=0 to 127 by 8 if ( v a) i:i+7 si ( v b) i:i+7 then v d i:i+7 ( v a) i:i+7 else v d i:i+7 ( v b) i:i+7 end each element of vmaxsb is a byte. each signed-integer element in v a is compared to the corresponding signed-integer element in v b. the larger of the two signed-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-45 shows the usage of the vmaxsb instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-45. vmaxsb?aximum of sixteen signed integer elements (8-bit) 04 v d v a v b 258 056 10 11 15 16 20 21 31 si si si si si si si si si si si si si si si si v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-74 altivec technology programming environments manual motorola altivec technology programming environments manual vmaxsh vmaxsh vector maximum signed half word vmaxsh v d, v a, v b form: vx do i=0 to 127 by 16 if ( v a) i:i+7 si ( v b) i:i+15 then v d i:i+15 ( v a) i:i+15 else v d i:i+15 ( v b) i:i+15 end each element of vmaxsh is a half word. each signed-integer element in v a is compared to the corresponding signed-integer element in v b. the larger of the two signed-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-46 shows the usage of the vmaxsh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits longlong. figure 6-46. vmaxsh?aximum of eight signed integer elements (16-bit) 04 v d v a v b 322 056 10 11 15 16 20 21 31 si si si si si si si si v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-75 altivec instruction set vmaxsw vmaxsw vector maximum signed word vmaxsw v d, v a, v b form: vx do i=0 to 127 by 32 if ( v a) i:i+31 si ( v b) i:i+31 then v d i:i+31 ( v a) i:i+31 else v d i:i+31 ( v b) i:i+31 end each element of vmaxsw is a word. each signed-integer element in v a is compared to the corresponding signed-integer element in v b. the larger of the two signed-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-47 shows the usage of the vmaxsw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-47. vmaxsw?aximum of four signed integer elements (32-bit) 04 v d v a v b 386 056 10 11 15 16 20 21 31 si si si si v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-76 altivec technology programming environments manual motorola altivec technology programming environments manual vmaxub vmaxub vector maximum signed byte vmaxub v d, v a, v b form: vx do i=0 to 127 by 8 if ( v a) i:i+7 ui ( v b) i:i+7 then v d i:i+7 ( v a) i:i+7 else v d i:i+7 ( v b) i:i+7 end each element of vmaxub is a byte. each unsigned-integer element in v a is compared to the corresponding unsigned-integer element in v b. the larger of the two unsigned-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-48 shows the usage of the vmaxub instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-48. vmaxub?aximum of sixteen unsigned integer elements (8-bit) 04 v d v a v b2 056 10 11 15 16 20 21 31 ui ui ui ui ui ui ui ui ui ui ui ui ui ui ui ui v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-77 altivec instruction set vmaxuh vmaxuh vector maximum unsigned half word vmaxuh v d, v a, v b form: vx do i=0 to 127 by 16 if ( v a) i:i+15 ui ( v b) i:i+15 then v d i:i+15 ( v a) i:i+15 else v d i:i+15 ( v b) i:i+15 end each element of vmaxuh is a half word. each unsigned-integer element in v a is compared to the corresponding unsigned-integer element in v b. the larger of the two unsigned-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-49 shows the usage of the vmaxuh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-49. vmaxuh?aximum of eight unsigned integer elements (16-bit) 04 v d v a v b66 056 10 11 15 16 20 21 31 ui ui ui ui ui ui ui ui v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-78 altivec technology programming environments manual motorola altivec technology programming environments manual vmaxuw vmaxuw vector maximum unsigned word vmaxuw v d, v a, v b form: vx do i=0 to 127 by 32 if ( v a) i:i+31 ui ( v b) i:i+31 then v d i:i+31 ( v a) i:i+31 else v d i:i+31 ( v b) i:i+31 end each element of vmaxuw is a word. each unsigned-integer element in v a is compared to the corresponding unsigned-integer element in v b. the larger of the two unsigned-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-50 shows the usage of the vmaxuw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-50. vmaxuw?aximum of four unsigned integer elements (32-bit) 04 v d v a v b 130 056 10 11 15 16 20 21 31 ui ui ui ui v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-79 altivec instruction set vmhaddshs vmhaddshs vector multiply high and add signed half word saturate vmhaddshs v d, v a, v b, v c form: va do i=0 to 127 by 16 prod 0:31 ( v a) i:i+15 * si ( v b) i:i+15 temp 0:16 prod 0:16 + int signextend(( v c) i:i+15 ,17) v d i:i+15 sitosisat(temp 0:16 ,16) end each signed-integer half word element in v a is multiplied by the corresponding signed-integer half word element in v b, producing a 32-bit signed-integer product. bits 0-16 of the intermediate product are added to the corresponding signed-integer half-word element in v c after they have been sign extended to 17-bits. the 16-bit saturated result from each of the eight 17-bit sums is placed in register v d. if the intermediate result is greater than (2 15 -1) it saturates to (2 15 -1) and if it is less than (-2 15 ) it saturates to (-2 15 ). the signed-integer result is placed into the corresponding half-word element of v d. other registers altered: vector status and control register (vscr): affected: sat figure 6-51 shows the usage of the vmhaddshs instruction. each of the eight elements in the vectors, v a, v b, v c, and v d, is 16 bits long. figure 6-51. vmhaddshs?ultiply-high and add eight signed integer elements (16-bit) 04 v d v a v b v c32 056 10 11 15 16 20 21 25 26 31 + s v a v b prod v c temp v d * * * * * * * * + s sat 17 16 16 + s + s + s + s + s + s f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-80 altivec technology programming environments manual motorola altivec technology programming environments manual vmhraddshs vmhraddshs vector multiply high round and add signed half word saturate vmhraddshs v d, v a, v b, v c form: va do i=0 to 127 by 16 prod 0:31 ( v a) i:i+15 * si ( v b) i:i+15 prod 0:31 prod 0:31 + int 0x0000_4000 temp 0:16 prod 0:16 + int signextend(( v c) i:i+15 ,17) ( v d) i:i+15 sitosisat(temp 0:16 ,16) end each signed integer halfword element in register v a is multiplied by the corresponding signed integer halfword element in register v b, producing a 32-bit signed integer product. the value 0x0000_4000 is added to the product, producing a 32-bit signed integer sum. bits 0?6 of the sum are added to the corresponding signed integer halfword element in register v d. if the intermediate result is greater than (2 15 -1) it saturates to (2 15 -1) and if it is less than (-2 15 ) it saturates to (-2 15 ). the signed integer result is and placed into the corresponding halfword element of register v d. figure 6-52 shows the usage of the vmhraddshs instruction. each of the eight elements in the vectors, v a, v b, v c, and v d, is 16 bits long. figure 6-52. vmhraddshs?ultiply-high round and add eight signed integer elements (16-bit) 04 v d v a v b v c33 056 10 11 15 16 20 21 25 26 31 + v a v b prod const temp v d * * * * * * * * + sat 17 16 16 + + + + + + 0......01 s v c s s s s s s s 18 0......01 0......01 0......01 0......01 0......01 0......01 0......01 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-81 altivec instruction set vminfp vminfp vector minimum floating point vminfp v d, v a, v b form: vx do i=0 to 127 by 32 if ( v a) i:i+31 < fp ( v b) i:i+31 then v d i:i+31 ( v a) i:i+31 else v d i:i+31 ( v b) i:i+31 end each single-precision ?ating-point word element in register v a is compared to the corresponding single-precision ?ating-point word element in register v b. the smaller of the two single-precision ?ating-point values is placed into the corresponding word element of register v d. the minimum of + 0.0 and - 0.0 is - 0.0. the minimum of any value and a nan is a qnan. if vscr[nj] = 1, every denormalized operand element is truncated to 0 before the comparison is made. figure 6-53 shows the usage of the vminfp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-53. vminfp?inimum of four floating-point elements (32-bit) 04 v d v a v b 1098 056 10 11 15 16 20 21 31 < fp < fp < fp < fp v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-82 altivec technology programming environments manual motorola altivec technology programming environments manual vminsb vminsb vector minimum signed byte vminsb v d, v a, v b form: vx do i=0 to 127 by 8 if ( v a) i:i+7 < si ( v b) i:i+7 then v d i:i+7 ( v a) i:i+7 else v d i:i+7 ( v b) i:i+7 end each element of vminsb is a byte. each signed-integer element in v a is compared to the corresponding signed-integer element in v b. the larger of the two signed-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-54 shows the usage of the vminsb instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-54. vminsb?inimum of sixteen signed integer elements (8-bit) 04 v d v a v b 770 056 10 11 15 16 20 21 31 < si < si < si < si < si < si < si < si < si < si < si < si < si < si < si < si v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-83 altivec instruction set vminsh vminsh vector minimum signed half word vminsh v d, v a, v b form: vx do i=0 to 127 by 16 if ( v a) i:i+15 < si ( v b) i:i+15 then v d i:i+15 ( v a) i:i+15 else v d i:i+15 ( v b) i:i+15 end each element of vminsh is a half word. each signed-integer element in v a is compared to the corresponding signed-integer element in v b. the larger of the two signed-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-55 shows the usage of the vminsh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-55. vminsh?inimum of eight signed integer elements (16-bit) 04 v d v a v b 834 056 10 11 15 16 20 21 31 < si < si < si < si < si < si < si < si v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-84 altivec technology programming environments manual motorola altivec technology programming environments manual vminsw vminsw vector minimum signed word vminsw v d, v a, v b form: vx do i=0 to 127 by 32 if ( v a) i:i+31 < si ( v b) i:i+31 then v d i:i+31 ( v a) i:i+31 else v d i:i+31 ( v b) i:i+31 end each element of vminsw is a word. each signed-integer element in v a is compared to the corresponding signed-integer element in v b. the larger of the two signed-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-56 shows the usage of the vminsw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-56. vminsw?inimum of four signed integer elements (32-bit) 04 v d v a v b 898 056 10 11 15 16 20 21 31 < si < si < si < si v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-85 altivec instruction set vminub vminub vector minimum unsigned byte vminub v d, v a, v b form: vx do i=0 to 127 by 8 if ( v a) i:i+7 < ui ( v b) i:i+7 then v d i:i+7 ( v a) i:i+7 else v d i:i+7 ( v b) i:i+7 end each element of vminub is a byte. each unsigned-integer element in v a is compared to the corresponding unsigned-integer element in v b. the larger of the two unsigned-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-57 shows the usage of the vminub instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-57. vminub?inimum of sixteen unsigned integer elements (8-bit) 04 v d v a v b 514 056 10 11 15 16 20 21 31 < ui < ui < ui < ui < ui < ui < ui < ui < ui < ui < ui < ui < ui < ui < ui < ui v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-86 altivec technology programming environments manual motorola altivec technology programming environments manual vminuh vminuh vector minimum unsigned half word vminuh v d, v a, v b form: vx do i=0 to 127 by 16 if ( v a) i:i+15 < ui ( v b) i:i+15 then v d i:i+15 ( v a) i:i+15 else v d i:i+15 ( v b) i:i+15 end each element of vminuh is a half word. each unsigned-integer element in v a is compared to the corresponding unsigned-integer element in v b. the larger of the two unsigned-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-58 shows the usage of the vminuh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-58. vminuh?inimum of eight unsigned integer elements (16-bit) 04 v d v a v b 578 056 10 11 15 16 20 21 31 < ui < ui < ui < ui < ui < ui < ui < ui v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-87 altivec instruction set vminuw vminuw vector minimum unsigned word vminuw v d, v a, v b form: vx do i=0 to 127 by 32 if ( v a) i:i+31 < ui ( v b) i:i+31 then v d i:i+31 ( v a) i:i+31 else v d i:i+31 ( v b) i:i+31 end each element of vminuw is a word. each unsigned-integer element in v a is compared to the corresponding unsigned-integer element in v b. the larger of the two unsigned-integer values is placed into the corresponding element of v d. other registers altered: none figure 6-59 shows the usage of the vminuw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-59. vminuw?inimum of four unsigned integer elements (32-bit) 04 v d v a v b 642 056 10 11 15 16 20 21 31 < ui < ui < ui < ui v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-88 altivec technology programming environments manual motorola altivec technology programming environments manual vmladduhm vmladduhm vector multiply low and add unsigned half word modulo vmladduhm v d, v a, v b, v c form: va do i=0 to 127 by 16 prod 0:31 ( v a) i:i+15 * ui ( v b) i:i+15 v d i:i+15 prod 0:31 + int ( v c) i:i+15 end each integer half-word element in v a is multiplied by the corresponding integer half-word element in v b, producing a 32-bit integer product. the product is added to the corresponding integer half-word element in v c. the integer result is placed into the corresponding half-word element of v d. note that vmladduhm can be used for unsigned or signed integers. other registers altered: none figure 6-60 shows the usage of the vmladduhm instruction. each of the eight elements in the vectors, v a, v b, v c, and v d, is 16 bits long. figure 6-60. vmladduhm?ultiply-add of eight integer elements (16-bit) 04 v d v a v b v c34 056 10 11 15 16 20 21 25 26 31 + v a v b prod v c temp v d * * * * * * * * + + + + + + + f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-89 altivec instruction set vmrghb vmrghb vector merge high byte vmrghb v d, v a, v b form: vx do i=0 to 63 by 8 v d i*2:(i*2)+15 ( v a) i:i+7 || ( v b) i:i+7 end each element of vmrghb is a byte. the elements in the high-order half of v a are placed, in the same order, into the even-numbered elements of v d. the elements in the high-order half of v b are placed, in the same order, into the odd-numbered elements of v d. other registers altered: none figure 6-61 shows the usage of the vmrghb instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-61. vmrghb?erge eight high-order elements (8-bit) 04 v d v a v b12 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-90 altivec technology programming environments manual motorola altivec technology programming environments manual vmrghh vmrghh vector merge high half word vmrghh v d, v a, v b form: vx do i=0 to 63 by 16 v d i*2:(i*2)+31 ( v a) i:i+15 || ( v b) i:i+15 end each element of vmrghh is a half word. the elements in the high-order half of v a are placed, in the same order, into the even-numbered elements of v d. the elements in the high-order half of v b are placed, in the same order, into the odd-numbered elements of v d. other registers altered: none figure 6-62 shows the usage of the vmrghh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-62. vmrghh?erge four high-order elements (16-bit) 04 v d v a v b76 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-91 altivec instruction set vmrghw vmrghw vector merge high word vmrghw v d, v a, v b form: vx do i=0 to 63 by 32 v d i*2:(i*2)+63 ( v a) i:i+31 || ( v b) i:i+31 end each element of vmrghw is a word. the elements in the high-order half of v a are placed, in the same order, into the even-numbered elements of v d. the elements in the high-order half of v b are placed, in the same order, into the odd-numbered elements of v d. other registers altered: none figure 6-63 shows the usage of the vmrghw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-63. vmrghw?erge four high-order elements (32-bit) 04 v d v a v b 140 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-92 altivec technology programming environments manual motorola altivec technology programming environments manual vmrglb vmrglb vector merge low byte vmrglb v d, v a, v b form: vx do i=0 to 63 by 8 v d i*2:(i*2)+15 ( v a) i+64:i+71 || ( v b) i+64:i+71 end each element offer vmrglb is a byte. the elements in the low-order half of v a are placed, in the same order, into the even-numbered elements of v d. the elements in the low-order half of v b are placed, in the same order, into the odd-numbered elements of v d. other registers altered: none figure 6-64 shows the usage of the vmrglb instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-64. vmrglb?erge eight low-order elements (8-bit) 04 v d v a v b 268 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-93 altivec instruction set vmrglh vmrglh vector merge low half word vmrglh v d, v a, v b form: vx do i=0 to 63 by 16 v d i*2:(i*2)+31 ( v a) i+64:i+79 || ( v b) i+64:i+79 end each element of vmrglh is a half word. the elements in the low-order half of v a are placed, in the same order, into the even-numbered elements of v d. the elements in the low-order half of v b are placed, in the same order, into the odd-numbered elements of v d. other registers altered: none figure 6-65 shows the usage of the vmrglh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-65. vmrglh?erge four low-order elements (16-bit) 04 v d v a v b 332 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-94 altivec technology programming environments manual motorola altivec technology programming environments manual vmrglw vmrglw vector merge low word vmrglw v d, v a, v b form: vx do i=0 to 63 by 32 v d i*2:(i*2)+63 ( v a) i+64:i+95 || ( v b) i+64:i+95 end each element of vmrglw is a word. the elements in the low-order half of v a are placed, in the same order, into the even-numbered elements of v d. the elements in the low-order half of v b are placed, in the same order, into the odd-numbered elements of v d. other registers altered: none figure 6-66 shows the usage of the vmrglw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-66. vmrglw?erge four low-order elements (32-bit) 04 v d v a v b 396 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-95 altivec instruction set vmsummbm vmsummbm vector multiply sum mixed-sign byte modulo vmsummbm v d, v a, v b, v c form: va do i=0 to 127 by 32 temp 0:31 ( v c) i:i+31 do j=0 to 31 by 8 prod 0:15 ( v a) i+j:i+j+7 * sui ( v b) i+j:i+j+7 temp 0:31 temp 0:31 + int signextend(prod 0:15 ,32) end v d i:i+31 temp 0:31 end for each word element in v c the following operations are performed in the order shown. each of the four signed-integer byte elements contained in the corresponding word element of v a is multiplied by the corresponding unsigned-integer byte element in v b, producing a signed-integer 16-bit product. the signed-integer modulo sum of these four products is added to the signed-integer word element in v c. the signed-integer result is placed into the corresponding word element of v d. other registers altered: none figure 6-67 shows the usage of the vmsummbm instruction. each of the sixteen elements in the vectors, v a, and v b, are 8 bits long. each of the four elements in the vectors, v c and v d are 32 bits long. figure 6-67. vmsummbm?ultiply-sum of integer elements (8-bit to 32-bit) 04 v d v a v b v c37 056 10 11 15 16 20 21 25 26 31 v a v b prod v c v d * * * * * * * * * * * * * * * * + + + + f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-96 altivec technology programming environments manual motorola altivec technology programming environments manual vmsumshm vmsumshm vector multiply sum signed half word modulo vmsumshm v d, v a, v b, v c form: va do i=0 to 127 by 32 temp 0:31 ( v c) i:i+31 do j=0 to 31 by 16 prod 0:31 ( v a) i+j:i+j+15 * si ( v b) i+j:i+j+15 temp 0:31 temp 0:31 + int prod 0:31 v d i:i+31 temp 0:31 end end for each word element in v c the following operations are performed in the order shown. each of the two signed-integer half-word elements contained in the corresponding word element of v a is multiplied by the corresponding signed-integer half-word element in v b, producing a signed-integer 32-bit product. the signed-integer modulo sum of these two products is added to the signed-integer word element in v c. the signed-integer result is placed into the corresponding word element of v d. other registers altered: none figure 6-68 shows the usage of the vmsumshm instruction. each of the eight elements in the vectors, v a, and v b, are 16 bits long. each of the four elements in the vectors, v c and v d are 32 bits long. figure 6-68. vmsumshm?ultiply-sum of signed integer elements (16-bit to 32-bit) 04 v d v a v b v c40 056 10 11 15 16 20 21 25 26 31 v a v b prod v c v d * * * * * * * + + + + * f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-97 altivec instruction set vmsumshs vmsumshs vector multiply sum signed half word saturate vmsumshs v d, v a, v b, v c form: va do i=0 to 127 by 32 temp 0:33 signextend(( v c) i:i+31 ,34) do j=0 to 31 by 16 prod 0:31 ( v a) i+j:i+j+15 * si ( v b) i+j:i+j+15 temp 0:33 temp 0:33 + int signextend(prod 0:31 ,34) v d i:i+31 sitosisat(temp 0:33 ,32) end end for each word element in v c the following operations are performed in the order shown. each of the two signed-integer half-word elements in the corresponding word element of v a is multiplied by the corresponding signed-integer half-word element in v b, producing a signed-integer 32-bit product. the signed-integer sum of these two products is added to the signed-integer word element in v c. if this intermediate result is greater than (2 31 -1) it saturates to (2 31 -1) and if it is less than -2 31 it saturates to -2 31 . the signed-integer result is placed into the corresponding word element of v d. other registers altered: ?at figure 6-69 shows the usage of the vmsumshs instruction. each of the eight elements in the vectors, v a, and v b, are 16 bits long. each of the four elements in the vectors, v c and v d are 32 bits long. figure 6-69. vmsumshs?ultiply-sum of signed integer elements (16-bit to 32-bit) 04 v d v a v b v c41 056 10 11 15 16 20 21 25 26 31 v a v b prod v c v d * * * * * * * + + + + * f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-98 altivec technology programming environments manual motorola altivec technology programming environments manual vmsumubm vmsumubm vector multiply sum unsigned byte modulo vmsumubm v d, v a, v b, v c form: va do i=0 to 127 by 32 temp 0:31 ( v c) i:i+31 do j=0 to 31 by 8 prod 0:15 ( v a) i+j:i+j+7 * ui ( v b) i+j:i+j+7 temp 0:32 temp 0:32 + int zeroextend(prod 0:15 ,32) v d i:i+31 temp 0:31 end end for each word element in v c the following operations are performed in the order shown. each of the four unsigned-integer byte elements contained in the corresponding word element of v a is multiplied by the corresponding unsigned-integer byte element in v b, producing an unsigned-integer 16-bit product. the unsigned-integer modulo sum of these four products is added to the unsigned-integer word element in v c. the unsigned-integer result is placed into the corresponding word element of v d. other registers altered: none figure 6-70 shows the usage of the vmsumubm instruction. each of the sixteen elements in the vectors, v a, and v b, are 8 bits long. each of the four elements in the vectors, v c and v d are 32 bits long. figure 6-70. vmsumubm?ultiply-sum of unsigned integer elements (8-bit to 32-bit) 04 v d v a v b v c36 056 10 11 15 16 20 21 25 26 31 v a v b prod v c v d * * * * * * * * * * * * * * * * + + + + f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-99 altivec instruction set vmsumuhm vmsumuhm vector multiply sum unsigned half word modulo vmsumuhm v d, v a, v b, v c form: va do i=0 to 127 by 32 temp 0:31 ( v c) i:i+31 do j=0 to 31 by 16 prod 0:31 ( v a) i+j:i+j+15 * ui ( v b) i+j:i+j+15 temp 0:31 temp 0:31 + int prod 0:31 v d i:i+31 temp 2:33 end end for each word element in v c the following operations are performed in the order shown. each of the two unsigned-integer half-word elements contained in the corresponding word element of v a is multiplied by the corresponding unsigned-integer half-word element in v b, producing a unsigned-integer 32-bit product. the unsigned-integer sum of these two products is added to the unsigned-integer word element in v c. the unsigned-integer result is placed into the corresponding word element of v d. other registers altered: none figure 6-71 shows the usage of the vmsumuhm instruction. each of the eight elements in the vectors, v a, and v b, are 16 bits long. each of the four elements in the vectors, v c and v d are 32 bits long. figure 6-71. vmsumuhm?ultiply-sum of unsigned integer elements (16-bit to 32-bit) 04 v d v a v b v c38 056 10 11 15 16 20 21 25 26 31 v a v b prod v c v d * * * * * * * + + + + * f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-100 altivec technology programming environments manual motorola altivec technology programming environments manual vmsumuhs vmsumuhs vector multiply sum unsigned half word saturate vmsumuhs v d, v a, v b, v c form: va do i=0 to 127 by 32 temp 0:33 zeroextend(( v c) i:i+31 ,34) do j=0 to 31 by 16 prod 0:31 ( v a) i+j:i+j+15 * ui ( v b) i+j:i+j+15 temp 0:33 temp 0:33 + int zeroextend(prod 0:31 ,34) v d i:i+31 uitouisat(temp 0:33 ,32) end end for each word element in v c the following operations are performed in the order shown. each of the two unsigned-integer half-word elements contained in the corresponding word element of v a is multiplied by the corresponding unsigned-integer half-word element in v b, producing an unsigned-integer 32-bit product. the unsigned-integer sum of these two products is saturate-added to the unsigned-integer word element in v c. the unsigned-integer result is placed into the corresponding word element of v d. other registers altered: ?at figure 6-72 shows the usage of the vmsumuhs instruction. each of the eight elements in the vectors, v a, and v b, are 16 bits long. each of the four elements in the vectors, v c and v d are 32 bits long. figure 6-72. vmsumuhs?ultiply-sum of unsigned integer elements (16-bit to 32-bit) 04 v d v a v b v c39 056 10 11 15 16 20 21 25 26 31 v a v b prod v c v d * * * * * * * + + + + * f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-101 altivec instruction set vmulesb vmulesb vector multiply even signed byte vmulesb v d, v a, v b form: vx do i=0 to 127 by 16 prod 0:15 ( v a) i:i+7 * si ( v b) i:i+7 v d i:i+15 prod 0:15 end each even-numbered signed-integer byte element in v a is multiplied by the corresponding signed-integer byte element in v b. the eight 16-bit signed-integer products are placed, in the same order, into the eight half-words of v d. other registers altered: none figure 6-73 shows the usage of the vmulesb instruction. each of the sixteen elements in the vectors, v a, and v b, is 8 bits long. each of the eight elements in the vector v d, is 16 bits long. figure 6-73. vmulesb?ven multiply of eight signed integer elements (8-bit) 04 v d v a v b 776 056 10 11 15 16 20 21 31 * * * * * * * * v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-102 altivec technology programming environments manual motorola altivec technology programming environments manual vmulesh vmulesh vector multiply even signed half word vmulesh v d, v a, v b form: vx do i=0 to 127 by 32 prod 0:31 ( v a) i:i+15 * si ( v b) i:i+15 v d i:i+31 prod 0:31 end each even-numbered signed-integer half-word element in v a is multiplied by the corresponding signed-integer half-word element in v b. the four 32-bit signed-integer products are placed, in the same order, into the four words of v d. other registers altered: none figure 6-74 shows the usage of the vmulesh instruction. each of the eight elements in the vectors, v a, and v b, is 16 bits long. each of the four elements in the vector v d, is 32 bits long. figure 6-74. vmulesb?ven multiply of four signed integer elements (16-bit) 04 v d v a v b 840 056 10 11 15 16 20 21 31 * v a v b v d * * * f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-103 altivec instruction set vmuleub vmuleub vector multiply even unsigned byte vmuleub v d, v a, v b form: vx do i=0 to 127 by 16 prod 0:15 ( v a) i:i+7 * ui ( v b) i:i+7 ( v d) i:i+15 prod 0:15 end each even-numbered unsigned-integer byte element in register v a is multiplied by the corresponding unsigned-integer byte element in register v b. the eight 16-bit unsigned-integer products are placed, in the same order, into the eight halfwords of register v d. other registers altered: none figure 6-75 shows the usage of the vmuleub instruction. each of the sixteen elements in the vectors, v a, and v b, is 8 bits long. each of the eight elements in the vector v d, is 16 bits long. figure 6-75. vmuleub?ven multiply of eight unsigned integer elements (8-bit) 04 v d v a v b 520 056 10 11 15 16 20 21 31 * * * * * * * * v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-104 altivec technology programming environments manual motorola altivec technology programming environments manual vmuleuh vmuleuh vector multiply even unsigned half word vmuleuh v d, v a, v b form: vx do i=0 to 127 by 32 prod 0:31 ( v a) i:i+15 * ui ( v b) i:i+15 ( v d) i:i+31 prod 0:31 end each even-numbered unsigned-integer halfword element in register v a is multiplied by the corresponding unsigned-integer halfword element in register v b. the four 32-bit unsigned-integer products are placed, in the same order, into the four words of register v d. other registers altered: none figure 6-76 shows the usage of the vmuleuh instruction. each of the eight elements in the vectors, v a, and v b, is 16 bits long. each of the four elements in the vector v d, is 32 bits long. figure 6-76. vmuleuh?ven multiply of four unsigned integer elements (16-bit) 04 v d v a v b 584 056 10 11 15 16 20 21 31 * v a v b v d * * * f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-105 altivec instruction set vmulosb vmulosb vector multiply odd signed byte vmulosb v d, v a, v b form: vx do i=0 to 127 by 16 prod 0:15 ( v a) i+8:i+15 * si ( v b) i+8:i+15 v d i:i+15 prod 0:15 end each odd-numbered signed-integer byte element in v a is multiplied by the corresponding signed-integer byte element in v b. the eight 16-bit signed-integer products are placed, in the same order, into the eight half-words of v d. other registers altered: none figure 6-77 shows the usage of the vmulosb instruction. each of the sixteen elements in the vectors, v a, and v b, is 8 bits long. each of the eight elements in the vector v d, is 16 bits long. figure 6-77. vmulosb?dd multiply of eight signed integer elements (8-bit) 04 v d v a v b 264 056 10 11 15 16 20 21 31 * * * * * * * * v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-106 altivec technology programming environments manual motorola altivec technology programming environments manual vmulosh vmulosh vector multiply odd signed half word vmulosh v d, v a, v b form: vx do i=0 to 127 by 32 prod 0:31 ( v a) i+16:i+31 * si ( v b) i+16:i+31 v d i:i+31 prod 0:31 end each odd-numbered signed-integer half-word element in v a is multiplied by the corresponding signed-integer half-word element in v b. the four 32-bit signed-integer products are placed, in the same order, into the four words of v d. other registers altered: none figure 6-78 shows the usage of the vmuleuh instruction. each of the eight elements in the vectors, v a, and v b, is 16 bits long. each of the four elements in the vector v d, is 32 bits long. figure 6-78. vmuleuh?dd multiply of four unsigned integer elements (16-bit) 04 v d v a v b 328 056 10 11 15 16 20 21 31 * * * * v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-107 altivec instruction set vmuloub vmuloub vector multiply odd unsigned byte vmuloub v d, v a, v b form: vx do i=0 to 127 by 8 prod 0:15 ( v a) i+8:i+15 * ui ( v b) i+n:i+15 v d i:i+15 prod 0:15 end each odd-numbered unsigned-integer byte element in v a is multiplied by the corresponding unsigned-integer byte element in v b. the eight 16-bit unsigned-integer products are placed, in the same order, into the eight half-word s of v d. other registers altered: none figure 6-79 shows the usage of the vmuloub instruction. each of the sixteen elements in the vectors, v a, and v b, is 8 bits long. each of the eight elements in the vector v d, is 16 bits long. figure 6-79. vmuloub?dd multiply of eight unsigned integer elements (8-bit) 04 v d v a v b8 056 10 11 15 16 20 21 31 * * * * * * * * v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-108 altivec technology programming environments manual motorola altivec technology programming environments manual vmulouh vmulouh vector multiply odd unsigned half word vmulouh v d, v a, v b form: vx do i=0 to 127 by 16 prod 0:31 ( v a) i+16:i+31 * ui ( v b) i+n:i+311 v d i:i+31 prod 0:31 end each odd-numbered unsigned-integer half-word element in v a is multiplied by the corresponding unsigned-integer half-word element in v b. the four 32-bit unsigned-integer products are placed, in the same order, into the four words of v d. other registers altered: none figure 6-80 shows the usage of the vmulouh instruction. each of the eight elements in the vectors, v a, and v b, is 16 bits long. each of the four elements in the vector v d, is 32 bits long. figure 6-80. vmulouh?dd multiply of four unsigned integer elements (16-bit) 04 v d v a v b72 056 10 11 15 16 20 21 31 * * * * v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-109 altivec instruction set vnmsubfp vnmsubfp vector negative multiply-subtract floating point vnmsubfp v d, v a, v c, v b form: va do i=0 to 127 by 32 v d i:i+31 -rndtonearfp32((( v a) i:i+31 * fp ( v c) i:i+31 ) - fp ( v b) i:i+31 ) end each single-precision ?ating-point word element in v a is multiplied by the corresponding single-precision ?ating-point word element in v c. the corresponding single-precision ?ating-point word element in v b is subtracted from the product. the sign of the difference is inverted. the result is rounded to the nearest single-precision ?ating-point number and placed into the corresponding word element of v d. note that only one rounding occurs in this operation. also note that a qnan result is not negated. other registers altered: none figure 6-81 shows the usage of the vnmsubfp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-81. vnmsubfp?egative multiply-subtract of four floating-point elements (32-bit) 04 v d v a v b v c47 056 10 11 15 16 20 21 25 26 31 - v a v c prod v b invert * * * * - - - v d & round f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-110 altivec technology programming environments manual motorola altivec technology programming environments manual vnor vnor vector logical nor vnor v d, v a, v b form: vx v d ?( ( v a) | ( v b)) the contents of v a are bitwise ored with the contents of v b and the complemented result is placed into v d. other registers altered: none simpli?d mnemonics: vnot v d, v s equivalent to vnor v d, v s, v s figure 6-82 shows the usage of the vnor instruction. figure 6-82. vnor?itwise nor of 128-bit vector 04 v d v a v b 1284 056 10 11 15 16 20 21 31 | v b intermediate v a v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-111 altivec instruction set vor vor vector logical or vor v d, v a, v b form: vx v d ( v a) | ( v b) the contents of v a are ored with the contents of v b and the result is placed into v d. other registers altered: none simpli?d mnemonics: vmr v d, v s equivalent to vor v d, v s, v s figure 6-83 shows the usage of the vor instruction. figure 6-83. vor?itwise or of 128-bit vector 04 v d v a v b 1156 056 10 11 15 16 20 21 31 | v b v a v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-112 altivec technology programming environments manual motorola altivec technology programming environments manual vperm vperm vector permute vperm v d, v a, v b, v c form: va temp 0:255 ( v a) || ( v b) do i=0 to 127 by 8 b ( v c) i+3:i+7 || 0b000 v d i:i+7 temp b:b+7 end let the source vector be the concatenation of the contents of v a followed by the contents of v b. for each integer i in the range 0?5, the contents of the byte element in the source vector speci?d in bits 3? of byte element i in v c are placed into byte element i of v d. other registers altered: none programming note: see the programming notes with the load vector for shift left and load vector for shift right instructions for examples of usage on the vperm instruction. figure 6-84 shows the usage of the vperm instruction. each of the sixteen elements in the vectors, v a, v b, v c, and v d, is 8 bits long. figure 6-84. vperm?oncatenate sixteen integer elements (8-bit) 04 v d v a v b v c43 056 10 11 15 16 20 21 25 26 31 v c 1 14 18 10 16 15 19 1a 1c 1c 1c 13 8 1d 1b 0e v a v b v d 0123456789abc d ef 10 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1f 11 1e f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-113 altivec instruction set vpkpx vpkpx vector pack pixel32 vpkpx v d, v a, v b form: vx do i=0 to 63 by 16 v d i ( v a) i*2+7 v d i+1:i+5 ( v a) (i*2)+8:(i*2)+12 v d i+6:i+10 ( v a) (i*2)+16:(i*2)+20 v d i+11:i+15 ( v a)( (i*2)+24:(i*2)+28 v d i+64 ( v b) (i*2)+7 v d i+65:i+69 ( v b) (i*2)+8:(i*2)+12 v d i+70:i+74 ( v b) (i*2)+16:(i*2)+20 v d i+75:i+79 ( v b) (i*2)+24:(i*2)+28 end the source vector is the concatenation of the contents of v a followed by the contents of v b. each 32-bit word element in the source vector is packed to produce a 16-bit half-word value as described below and placed into the corresponding half-word element of v d. a word is packed to 16 bits by concatenating, in order, the following bits. bit 7 of the ?st byte (bit 7 of the word) bits 0? of the second byte (bits 8?2 of the word) bits 0? of the third byte (bits 16?0 of the word) bits 0? of the fourth byte (bits 24?8 of the word) figure 6-85 shows which bits of the source word are packed to form the half word in the destination register. figure 6-85. how a word is packed to a half word 04 v d v a v b 782 056 10 11 15 16 20 21 31 source word 012345678910111213141516171819202122232425262728293031 vd packed half word 0123456789101112131415 7 8 9 10 11 12 12 17 18 19 20 24 25 26 27 28 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-114 altivec technology programming environments manual motorola altivec technology programming environments manual other registers altered: none programming note: each source word can be considered to be a 32-bit pixel consisting of four 8-bit channels. each target half-word can be considered to be a 16-bit pixel consisting of one 1-bit channel and three 5-bit channels. a channel can be used to specify the intensity of a particular color, such as red, green, or blue, or to provide other information needed by the application. figure 6-86 shows the usage of the vpkpx instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-86. vpkpx?ack eight elements (32-bit) to eight elements (16-bit) v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-115 altivec instruction set vpkshss vpkshss vector pack signed half word signed saturate vpkshss v d, v a, v b form: vx do i=0 to 63 by 8 v d i:i+7 sitosisat(( v a) i*2:(i*2)+15 ,8) v d i+64:i+71 sitosisat(( v b) i*2:(i*2)+15 ,8) end let the source vector be the concatenation of the contents of v a followed by the contents of v b. each signed integer half-word element in the source vector is converted to an 8-bit signed integer. if the value of the element is greater than (2 7 - 1) the result saturates to (2 7 - 1) and if the value is less than -2 7 the result saturates to -2 7 . the result is placed into the corresponding byte element of v d. other registers altered: ?at figure 6-87 shows the usage of the vpkshss instruction. each of the eight elements in the vectors, v a, and v b, is 16 bits long. each of the sixteen elements in the vector v d, is 8 bits long. figure 6-87. vpkshss?ack sixteen signed integer elements (16-bit) to sixteen signed integer elements (8-bit) 04 v d v a v b 398 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-116 altivec technology programming environments manual motorola altivec technology programming environments manual vpkshus vpkshus vector pack signed half word unsigned saturate vpkshus v d, v a, v b form: vx do i=0 to 63 by 8 v d i:i+7 sitouisat(( v a) i*2:(i*2)+7 ,8) v d i+64:i+71 sitouisat(( v b) i*2:(i*2)+7 ,8) end let the source vector be the concatenation of the contents of v a followed by the contents of v b. each signed integer half-word element in the source vector is converted to an 8-bit unsigned integer. if the value of the element is greater than (2 8 - 1) the result saturates to (2 8 - 1) and if the value is less than 0 the result saturates to 0. the result is placed into the corresponding byte element of v d. other registers altered: ?at figure 6-88 shows the usage of the vpkshus instruction. each of the eight elements in the vectors, v a, and v b, is 16 bits long. each of the sixteen elements in the vector v d, is 8 bits long. figure 6-88. vpkshus?ack sixteen signed integer elements (16-bit) to sixteen unsigned integer elements (8-bit) 04 v d v a v b 270 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-117 altivec instruction set vpkswss vpkswss vector pack signed word signed saturate vpkswss v d, v a, v b form: vx do i=0 to 63 by 16 v d i:i+15 sitosisat(( v a) i*2:(i*2)+31 ,16) v d i+64:i+79 sitosisat(( v b) i*2:(i*2)+31 ,16) end let the source vector be the concatenation of the contents of v a followed by the contents of v b. each signed integer word element in the source vector is converted to a 16-bit signed integer half word. if the value of the element is greater than (2 15 - 1) the result saturates to (2 15 - 1) and if the value is less than -2 15 the result saturates to -2 15 . the result is placed into the corresponding half-word element of v d. other registers altered: ?at figure 6-89 shows the usage of the vpkswss instruction. each of the four elements in the vectors, v a, and v b, is 32 bits long. each of the eight elements in the vector v d, is 16 bits long. g figure 6-89. vpkswss?ack eight signed integer elements (32-bit) to eight signed integer elements (16-bit) 04 v d v a v b 462 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-118 altivec technology programming environments manual motorola altivec technology programming environments manual vpkswus vpkswus vector pack signed word unsigned saturate vpkswus v d, v a, v b form: vx do i=0 to 63 by 16 v d i:i+15 sitouisat(( v a) i*2:(i*2)+31 ,16) v d i+64:i+79 sitouisat(( v b) i*2:(i*2)+31 ,16) end let the source vector be the concatenation of the contents of v a followed by the contents of v b. each signed integer word element in the source vector is converted to a 16-bit unsigned integer. if the value of the element is greater than (2 16 - 1) the result saturates to (2 16 - 1) and if the value is less than 0 the result saturates to 0. the result is placed into the corresponding half-word element of v d. other registers altered: ?at figure 6-90 shows the usage of the vpkswus instruction. each of the four elements in the vectors, v a, and v b, is 32 bits long. each of the eight elements in the vector v d, is 16 bits long. figure 6-90. vpkswus?ack eight signed integer elements (32-bit) to eight unsigned integer elements (16-bit) 04 v d v a v b 334 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-119 altivec instruction set vpkuhum vpkuhum vector pack unsigned half word unsigned modulo vpkuhum v d, v a, v b form: vx do i=0 to 63 by 8 v d i:i+7 ( v a) (i*2)+8:(i*2)+15 v d i+64:i+71 ( v b) (i*2)+8:(i*2)+15 end let the source vector be the concatenation of the contents of v a followed by the contents of v b. the low-order byte of each half-word element in the source vector is placed into the corresponding byte element of v d. other registers altered: none figure 6-91 shows the usage of the vpkuhum instruction. each of the eight elements in the vectors, v a, and v b, is 16 bits long. each of the sixteen elements in the vector v d, is 8 bits long. figure 6-91. vpkuhum?ack sixteen unsigned integer elements (16-bit) to sixteen unsigned integer elements (8-bit) 04 v d v a v b14 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-120 altivec technology programming environments manual motorola altivec technology programming environments manual vpkuhus vpkuhus vector pack unsigned half word unsigned saturate vpkuhus v d, v a, v b form: vx do i=0 to 63 by 8 v d i:i+7 uitouisat(( v a) i*2:(i*2)+15 ,8) v d i+64:i+71 uitouisat(( v b) i*2:(i*2)+15 ,8) end let the source vector be the concatenation of the contents of v a followed by the contents of v b. each unsigned integer half-word element in the source vector is converted to an 8-bit unsigned integer. if the value of the element is greater than (2 8 - 1) the result saturates to (2 8 - 1). the result is placed into the corresponding byte element of v d. other registers altered: ?at figure 6-92 shows the usage of the vpkuhus instruction. each of the eight elements in the vectors, v a, and v b, is 16 bits long. each of the sixteen elements in the vector v d, is 8 bits long. figure 6-92. vpkuhus?ack sixteen unsigned integer elements (16-bit) to sixteen unsigned integer elements (8-bit) 04 v d v a v b 142 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-121 altivec instruction set vpkuwum vpkuwum vector pack unsigned word unsigned modulo vpkuwum v d, v a, v b form: vx do i=0 to 63 by 16 v d i:i+15 ( v a) (i*2)+16:(i*2)+31 v d i+64:i+79 ( v b) (i*2)+16:(i*2)+31 end let the source vector be the concatenation of the contents of v a followed by the contents of v b. the low-order half-word of each word element in the source vector is placed into the corresponding half-word element of v d. other registers altered: none figure 6-93 shows the usage of the vpkuwum instruction. each of the four elements in the vectors, v a, and v b, is 32 bits long. each of the eight elements in the vector v d, is 16 bits long. figure 6-93. vpkuwum?ack eight unsigned integer elements (32-bit) to eight unsigned integer elements (16-bit) 04 v d v a v b78 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-122 altivec technology programming environments manual motorola altivec technology programming environments manual vpkuwus vpkuwus vector pack unsigned word unsigned saturate vpkuwus v d, v a, v b form: vx do i=0 to 63 by 16 v d i:i+15 uitouisat(( v a) i*2:(i*2)+31 ,16) v d i+64:i+79 uitouisat(( v b) i*2:(i*2)+31 ,16) end let the source vector be the concatenation of the contents of v a followed by the contents of v b. each unsigned integer word element in the source vector is converted to a 16-bit unsigned integer. if the value of the element is greater than (2 16 - 1) the result saturates to (2 16 - 1). the result is placed into the corresponding half-word element of v d. other registers altered: ?at figure 6-94 shows the usage of the vpkuwus instruction. each of the four elements in the vectors, v a, and v b, is 32 bits long. each of the eight elements in the vector v d, is 16 bits long. figure 6-94. vpkuwum?ack eight unsigned integer elements (32-bit) to eight unsigned integer elements (16-bit) 04 v d v a v b 206 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-123 altivec instruction set vrefp vrefp vector reciprocal estimate floating point vrefp v d, v b form: vx do i=0 to 127 by 32 x ( v b) i:i+31 v d i:i+31 1/x end the single-precision ?ating-point estimate of the reciprocal of each single-precision ?ating-point element in v b is placed into the corresponding element of v d. for results that are not a +0, -0, + , - , or qnan, the estimate has a relative error in precision no greater than one part in 4096, that is: where x is the value of the element in v b. note that the value placed into the element of v d may vary between implementations, and between different executions on the same implementation. operation with various special values of the element in v b is summarized below in table 6-7. if vscr[nj] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each denormalized result element truncates to a 0 of the same sign. other registers altered: none 04 v d 0_0000 v b 266 056 10 11 15 16 20 21 31 table 6-7. special values of the element in vb value result - -0 -0 - +0 + + +0 nan qnan estimate 1 x ? 1x ? ------------------------------------------ 1 4096 ------------ - f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-124 altivec technology programming environments manual motorola altivec technology programming environments manual figure 6-95 shows the usage of the vrefp instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-95. vrefp?eciprocal estimate of four floating-point elements (32-bit) 1 / x 1 / x 1 / x 1 / x v b v d xxxx f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-125 altivec instruction set vrfim vrfim vector round to floating-point integer toward minus infinity vr? v d, v b form: vx do i=0 to 127 by 32 v d i:i+31 rndtofpint32floor(( v b) i:i+31 ) end each single-precision ?ating-point word element in v b is rounded to a single-precision ?ating-point integer, using the rounding mode round toward -in?ity, and placed into the corresponding word element of v d. other registers altered: none figure 6-96 shows the usage of the vr? instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-96. vr??round to minus in?ity of four floating-point integer elements (32-bit) 04 v d 0_0000 v b 714 056 10 11 15 16 20 21 31 rndtofpint32floor rndtofpint32floor rndtofpint32floor rndtofpint32floor v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-126 altivec technology programming environments manual motorola altivec technology programming environments manual vrfin vrfin vector round to floating-point integer nearest vr? v d, v b form: vx do i=0 to 127 by 32 v d i:i+31 rndtofpint32near(( v b) i:i+31 ) end each single-precision ?ating-point word element in v b is rounded to a single-precision ?ating-point integer, using the rounding mode round to nearest, and placed into the corresponding word element of v d. note the result is independent of vscr[nj]. other registers altered: none figure 6-97 shows the usage of the vr? instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-97. vr??earest round to nearest of four floating-point integer elements (32-bit) 04 v d 0_0000 v b 522 056 10 11 15 16 20 21 31 rndtofpint32near rndtofpint32near rndtofpint32nea rndtofpint32near v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-127 altivec instruction set vrfip vrfip vector round to floating-point integer toward plus infinity vr? v d, v b form: vx do i=0 to 127 by 32 v d i:i+31 rndtofpint32ceil(( v b) i:i+31 ) end each single-precision ?ating-point word element in v b is rounded to a single-precision ?ating-point integer, using the rounding mode round toward +in?ity, and placed into the corresponding word element of v d. if vscr[nj] = 1, every denormalized operand element is truncated to 0 before the comparison is made. other registers altered: none figure 6-98 shows the usage of the vr? instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-98. vr??ound to plus in?ity of four floating-point integer elements (32-bit) 04 v d 0_0000 v b 650 056 10 11 15 16 20 21 31 rndtofpint32ceil rndtofpint32ceil rndtofpint32ceil rndtofpint32ceil v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-128 altivec technology programming environments manual motorola altivec technology programming environments manual vrfiz vrfiz vector round to floating-point integer toward zero vr? v d, v b form: vx do i=0 to 127 by 32 v d i:i+31 rndtofpint32trunc(( v b) i:i+31 ) end each single-precision ?ating-point word element in v b is rounded to a single-precision ?ating-point integer, using the rounding mode round toward zero, and placed into the corresponding word element of v d. note, the result is independent of vscr[nj]. other registers altered: none figure 6-99 shows the usage of the vr? instruction. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-99. vr??ound-to-zero of four floating-point integer elements (32-bit) 04 v d 0_0000 v b 586 056 10 11 15 16 20 21 31 rndtofpint32trunc rndtofpint32trunc rndtofpint32trunc rndtofpint32trunc v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-129 altivec instruction set vrlb vrlb vector rotate left integer byte vrlb v d, v a, v b form: vx do i=0 to 127 by 8 sh ( v b) i+5:i+7 v d i:i+7 rotl(( v a) i:i+7 ,sh) end each element is a byte. each element in v a is rotated left by the number of bits speci?d in the low-order 3 bits of the corresponding element in v b. the result is placed into the corresponding element of v d. other registers altered: none figure 6-100 shows the usage of the vrlb instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. g figure 6-100. vrlb?eft rotate of sixteen integer elements (8-bit) 04 v d v a v b4 056 10 11 15 16 20 21 31 v a v d v b f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-130 altivec technology programming environments manual motorola altivec technology programming environments manual vrlh vrlh vector rotate left integer half word vrlh v d, v a, v b form: vx do i=0 to 127 by 16 sh ( v b) i+12:i+15 v d i:i+15 rotl(( v a) i:i+15 ,sh) end each element is a half word each element in v a is rotated left by the number of bits speci?d in the low-order 4 bits of the corresponding element in v b. the result is placed into the corresponding element of v d. other registers altered: none figure 6-101 shows the usage of the vrlh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-101. vrlh?eft rotate of eight integer elements (16-bit) 04 v d v a v b68 056 10 11 15 16 20 21 31 v a v d v b f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-131 altivec instruction set vrlw vrlw vector rotate left integer word vrlw v d, v a, v b form: vx do i=0 to 127 by 32 sh ( v b) i+27:i+31 v d i:i+31 rotl(( v a) i:i+31 ,sh) end each element is a word. each element in v a is rotated left by the number of bits speci?d in the low-order 5 bits of the corresponding element in v b. the result is placed into the corresponding element of v d. other registers altered: none figure 6-102 shows the usage of the vrlw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-102. vrlw?eft rotate of four integer elements (32-bit) 04 v d v a v b 132 056 10 11 15 16 20 21 31 v a v d v b f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-132 altivec technology programming environments manual motorola altivec technology programming environments manual vrsqrtefp vrsqrtefp vector reciprocal square root estimate floating point vrsqrtefp v d, v b form: vx do i=0 to 127 by 32 x ( v b) i:i+31 v d i:i+31 1 fp ( fp (x)) end the single-precision estimate of the reciprocal of the square root of each single-precision element in v b is placed into the corresponding word element of v d. the estimate has a relative error in precision no greater than one part in 4096, as explained below: where x is the value of the element in v b. note that the value placed into the element of v d may vary between implementations and between different executions on the same implementation. operation with various special values of the element in v b is summarized below in table 6-8. other registers altered: none figure 6-103 shows the usage of the vrsqrtefp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-103. vrsqrtefp?eciprocal square root estimate of four floating-point elements (32-bit) 04 v d 0_0000 v b 330 056 10 11 15 16 20 21 31 table 6-8. special values of the element in vb value result value result - qnan +0 + less than 0 qnan + +0 -0 - nan qnan estimate 1 x ? 1 x ? ----------------------------------------------- - 1 4096 ------------ - 1 / x v b v d 1 / x 1 / x 1 / x f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-133 altivec instruction set vsel vsel vector conditional select vsel v d, v a, v b, v c form: va do i=0 to 127 if ( v c) i =0 then v d i ( v a) i else v d i ( v b) i end for each bit in v c that contains the value 0, the corresponding bit in v a is placed into the corresponding bit of v d. for each bit in v c that contains the value 1, the corresponding bit in v b is placed into the corresponding bit of v d. other registers altered: none figure 6-104 shows the usage of the vsel instruction. each of the vectors, v a, v b, v c, and v d, is 128 bits long. figure 6-104. vsel?itwise conditional select of vector contents(128-bit) 04 v d v a v b v c42 056 10 11 15 16 20 21 25 26 31 v b v a v c 0 1 0 0 1 1 0 0 ? ? ? ? ? ? ? ? ? ? v d ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-134 altivec technology programming environments manual motorola altivec technology programming environments manual vsl vsl vector shift left vsl v d, v a, v b form: vx sh ( v b) 125:127 t 1 do i = 0 to 127 by 8 t t & (( v b)i+5:i+7 = sh) if t = 1 then v d ( v a) << ui sh else v d undefined end the contents of v a are shifted left by the number of bits speci?d in v b[125?27]. bits shifted out of bit 0 are lost. zeros are supplied to the vacated bits on the right. the result is placed into v d. the contents of the low-order three bits of all byte elements in v b must be identical to v b[125?27]; otherwise the value placed into v d is unde?ed. other registers altered: none figure 6-105 shows the usage of the vsl instruction. figure 6-105. vsl?hift bits left in vector (128-bit) 04 v d v a v b 452 056 10 11 15 16 20 21 31 v a v d ? ? ? ? ? ? ? ? ? *6 = sh = shift count 125 127 sh zeros v b 6* 0_0000 0 shift f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-135 altivec instruction set vslb vslb vector shift left integer byte vslb v d, v a, v b form: vx do i=0 to 127 by 8 sh ( v b) i+5):i+7 v d i:i+7 ( v a) i:i+7 << ui sh end each element is a byte. each element in v a is shifted left by the number of bits speci?d in the low-order 3 bits of the corresponding element in v b. bits shifted out of bit 0 of the element are lost. zeros are supplied to the vacated bits on the right. the result is placed into the corresponding element of v d. other registers altered: none figure 6-106 shows the usage of the vslb instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-106. vslb?hift bits left in sixteen integer elements (8-bit) 04 v d v a v b 260 056 10 11 15 16 20 21 31 6 6 6 6 6 6 6 66 6 6 6 6 6 v b v a v d *6 = sh = shift count *6 125 127 0...0 sh 6 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 zeros f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-136 altivec technology programming environments manual motorola altivec technology programming environments manual vsldoi vsldoi vector shift left double by octet immediate vsldoi v d, v a, v b, shb form: va v d (( v a) || ( v b)) << ui (shb || 0b000) let the source vector be the concatenation of the contents of v a followed by the contents of v b. bytes shb:shb+15 of the source vector are placed into v d. other registers altered: none figure 6-107 shows the usage of the vsldoi instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-107. vsldoi?hift left by bytes speci?d 04 v d v a v b 0sh 44 056 10 11 15 16 20 21 22 25 26 31 v a v b v d shb f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-137 altivec instruction set vslh vslh vector shift left integer half word vslh v d, v a, v b form: vx do i=0 to 127 by 16 sh ( v b) i+12:i+15 v d i:i+15 ( v a) i:i+15 << ui sh end each element is a half word. each element in v a is shifted left by the number of bits speci?d in the low-order 4 bits of the corresponding element in v b. bits shifted out of bit 0 of the element are lost. zeros are supplied to the vacated bits on the right. the result is placed into the corresponding element of v d. other registers altered: none figure 6-108 shows the usage of the vslh instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-108. vslh?hift bits left in eight integer elements (16-bit) 04 v d v a v b 324 056 10 11 15 16 20 21 31 66 6 6 6 6 6 v b v a v d *6 = sh = shift count *6 124 127 0...0 sh 0...0 0...0 0...0 0...0 0...0 0...0 0...0 *x f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-138 altivec technology programming environments manual motorola altivec technology programming environments manual vslo vslo vector shift left by octet vslo v d, v a, v b form: vx shb ( v b) 121:124 v d ( v a) << ui (shb || 0b000) the contents of v a are shifted left by the number of bytes speci?d in v b[121?24]. bytes shifted out of byte 0 are lost. zeros are supplied to the vacated bytes on the right. the result is placed into v d. other registers altered: none figure 6-109 shows the usage of the vslo instruction. figure 6-109. vslo?eft byte shift of vector (128-bit) 04 v d v a v b 1036 056 10 11 15 16 20 21 31 v b v a v d ? ? ? ? ? ? ? ? ? *4 = shb = shift count don? care 121 124 0 0 0 0 0 0 0 0 *4 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-139 altivec instruction set vslw vslw vector shift left integer word vslw v d, v a, v b form: vx do i=0 to 127 by 32 sh ( v b) i+27:i+31 v d i:i+31 ( v a) i:i+31 << ui sh end each element is a word. each element in v a is shifted left by the number of bits speci?d in the low-order 5 bits of the corresponding element in v b. bits shifted out of bit 0 of the element are lost. zeros are supplied to the vacated bits on the right. the result is placed into the corresponding element of v d. other registers altered: none figure 6-110 shows the usage of the vslw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-110. vslw?hift bits left in four integer elements (32-bit) 04 v d v a v b 388 056 10 11 15 16 20 21 31 6 6 6 v b v a v d *6 = sh = shift count *6 123 127 sh 000000 000000 000000 zeros 000000 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-140 altivec technology programming environments manual motorola altivec technology programming environments manual vspltb vspltb vector splat byte vspltb v d, v b,uimm form: vx b uimm*8 do i=0 to 127 by 8 v d i:i+7 ( v b) b:b+7 end each element of vspltb is a byte. the contents of element uimm in v b are replicated into each element of v d. other registers altered: none programming note: the vector splat instructions can be used in preparation for performing arithmetic for which one source vector is to consist of elements that all have the same value (for example, multiplying all elements of a vector register by a constant). figure 6-111 shows the usage of the vspltb instruction. each of the sixteen elements in the vectors v b and v d is 8 bits long. figure 6-111. vspltb?opy contents to sixteen elements (8-bit) 04 v d uimm v b 524 056 10 11 15 16 20 21 31 v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-141 altivec instruction set vsplth vsplth vector splat half word vsplth v d, v b,uimm form: vx b uimm*16 do i=0 to 127 by 16 v d i:i+15 ( v b) b:b+15 end each element of vsplth is a half word. the contents of element uimm in v b are replicated into each element of v d. other registers altered: none programming note: the vector splat instructions can be used in preparation for performing arithmetic for which one source vector is to consist of elements that all have the same value (for example, multiplying all elements of a vector register by a constant). figure 6-112 shows the usage of the vsplth instruction. each of the eight elements in the vectors v b and v d is 16 bits long. figure 6-112. vsplth?opy contents to eight elements (16-bit) 04 v d uimm v b 588 056 10 11 15 16 20 21 31 v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-142 altivec technology programming environments manual motorola altivec technology programming environments manual vspltisb vspltisb vector splat immediate signed byte vspltisb v d,simm form: vx do i=0 to 127 by 8 v d i:i+7 signextend(simm,8) end each element of vspltisb is a byte. the value of the simm ?ld, sign-extended to the length of the element, is replicated into each element of v d. other registers altered: none figure 6-113 shows the usage of the vspltisb instruction. each of the sixteen elements in the vector, v d, is 8 bits long. figure 6-113. vspltisb?opy value into sixteen signed integer elements (8-bit) 04 v d simm 0000_0 780 056 10 11 15 16 20 21 31 simm v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-143 altivec instruction set vspltish vspltish vector splat immediate signed half word vspltish v d,simm form: vx do i=0 to 127 by 16 v d i:i+15 signextend(simm,16) end each element of vspltish is a half word. the value of the simm ?ld, sign-extended to the length of the element, is replicated into each element of v d. other registers altered: none figure 6-114 shows the usage of the vspltish instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-114. vspltish?opy value to eight signed integer elements (16-bit) 04 v d simm 0000_0 844 056 10 11 15 16 20 21 31 simm v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-144 altivec technology programming environments manual motorola altivec technology programming environments manual vspltisw vspltisw vector splat immediate signed word vspltisw v d,simm form: vx do i=0 to 127 by 32 v d i:i+31 signextend(simm,32) end each element of vspltisw is a word. the value of the simm ?ld, sign-extended to the length of the element, is replicated into each element of v d. other registers altered: none figure 6-115 shows the usage of the vspltisw instruction. each of the four elements in the vector, and v d, is 32 bits long. figure 6-115. vspltisw?opy value to four signed elements (32-bit) 04 v d simm 0000_0 908 056 10 11 15 16 20 21 31 v d simm f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-145 altivec instruction set vspltw vspltw vector splat word vspltw v d, v b,uimm form: vx b uimm*32 do i=0 to 127 by 32 v d i:i+31 ( v b) b:b+31 end each element of vspltw is a word. the contents of element uimm in v b are replicated into each element of v d. other registers altered: none programming note: the vector splat instructions can be used in preparation for performing arithmetic for which one source vector is to consist of elements that all have the same value (for example, multiplying all elements of a vector register by a constant). figure 6-116 shows the usage of the vspltw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-116. vspltw?opy contents to four elements (32-bit) 04 v d uimm v b 652 056 10 11 15 16 20 21 31 v d uimm f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-146 altivec technology programming environments manual motorola altivec technology programming environments manual vsr vsr vector shift right vsr v d, v a, v b form: vx sh ( v b) 125:127 t 1 do i = 0 to 127 by 8 t t & (( v b) i+5:i+7 = sh) if t = 1 then v d ( v a) >> ui sh else v d undefined end let sh = v b[125?27]; sh is the shift count in bits (0 sh 7). the contents of v a are shifted right by sh bits. bits shifted out of bit 127 are lost. zeros are supplied to the vacated bits on the left. the result is placed into v d. the contents of the low-order three bits of all byte elements in register v b must be identical to v b[125-127]; otherwise the value placed into register v d is unde?ed. other registers altered: none programming notes: a pair of vslo and vsl or vsro and vsr instructions, specifying the same shift count register, can be used to shift the contents of a vector register left or right by the number of bits (0?27) speci?d in the shift count register. the following example shifts the contents of v x left by the number of bits speci?d in v y and places the result into v z. vslo vz,vx,vy vsl vz,vz,vy a double-register shift by a dynamically speci?d number of bits (0?27) can be performed in six instructions. the following example shifts ( v w) || ( v x) left by the number of bits speci?d in v y and places the high-order 128 bits of the result into v z. vslo t1,vw,vy #shift high-order reg left vsl t1,t1,vy vsububm t3,v0,vy #adjust shift count ((v0)=0) vsro t2,vx,t3 #shift low-order reg right vsr t2,t2,t3 vor vz,t1,t2 #merge to get final result 04 v d v a v b 708 056 10 11 15 16 20 21 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-147 altivec instruction set figure 6-117 shows the usage of the vsr instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-117. vsr?hift bits right for vectors (128-bit) v b v a v d ? ? ? ? ? ? ? ? ? *6 = sh = shift count 6* 125 127 0...0 sh zeros f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-148 altivec technology programming environments manual motorola altivec technology programming environments manual vsrab vsrab vector shift right algebraic byte vsrab v d, v a, v b form: vx do i=0 to 127 by 8 sh ( v b) i+2:i+7 v d i:i+7 ( v a) i:i+7 >> si sh end each element is a byte. each element in v a is shifted right by the number of bits speci?d in the low-order 3 bits of the corresponding element in v b. bits shifted out of bit n-1 of the element are lost. bit 0 of the element is replicated to ?l the vacated bits on the left. the result is placed into the corresponding element of v d. other registers altered: none figure 6-118 shows the usage of the vsrab instruction. each of the sixteen elements in the vectors, v a, and v d, is 8 bits long. figure 6-118. vsrab?hift bits right in sixteen integer elements (8-bit) 04 v d v a v b 772 056 10 11 15 16 20 21 31 6 6 6 6 6 6 6 66 6 6 6 6 6 v b v a v d *6 = sh = shift count *6 125 127 x..x sh 6 x..x x..x x..x x..x x..x x..x x..x x..x x..x x..x x..x x..x x..x x..x x..x *bit x *bit x = bit 0 of each element f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-149 altivec instruction set vsrah vsrah vector shift right algebraic half word vsrah v d, v a, v b form: vx do i=0 to 127 by 16 sh ( v b) i+12:i+15 v d i:i+15 ( v a) i:i+15 >> si sh end each element is a half word. each element in v a is shifted right by the number of bits speci?d in the low-order 4 bits of the corresponding element in v b. bits shifted out of bit 15 of the element are lost. bit 0 of the element is replicated to ?l the vacated bits on the left. the result is placed into the corresponding element of v d. other registers altered: none figure 6-119 shows the usage of the vsrah instruction. each of the eight elements in the vectors, v a, and v d, is 16 bits long. figure 6-119. vsrah?hift bits right for eight integer elements (16-bit) 04 v d v a v b 836 056 10 11 15 16 20 21 31 66 6 6 6 6 6 v b v a v d *6 = sh = shift count *6 124 127 x...x sh x...x x...x x...x x...x x...x x...x x...x *x *x = bit 0 of each element f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-150 altivec technology programming environments manual motorola altivec technology programming environments manual vsraw vsraw vector shift right algebraic word vsraw v d, v a, v b form: vx do i=0 to 127 by 32 sh ( v b) i+27:i+31 v d i:i+31 ( v a) i:i+31 >> si sh end each element is a word. each element in v a is shifted right by the number of bits speci?d in the low-order 5 bits of the corresponding element in v b. bits shifted out of bit 31 of the element are lost. bit 0 of the element is replicated to ?l the vacated bits on the left. the result is placed into the corresponding element of v d. other registers altered: none figure 6-120 shows the usage of the vsraw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-120. vsraw?hift bits right in four integer elements (32-bit) 04 v d v a v b 900 056 10 11 15 16 20 21 31 6 6 6 v b v a v d *6 = sh = shift count *6 123 127 sh x...x x...x x....x *x x...x *x = bit 0 of each element f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-151 altivec instruction set vsrb vsrb vector shift right byte vsrb v d, v a, v b form: vx do i=0 to 127 by 8 sh ( v b) i+5:i+7 v d i:i+7 ( v a) i:i+7 >> ui sh end each element is a byte. each element in v a is shifted right by the number of bits speci?d in the low-order 3 bits of the corresponding element in v b. bits shifted out of bit 7 of the element are lost. zeros are supplied to the vacated bits on the left. the result is placed into the corresponding element of v d. other registers altered: none figure 6-121 shows the usage of the vsrb instruction. each of the sixteen elements in the vectors, v a, and v d, is 8 bits long. figure 6-121. vsrb?hift bits right in sixteen integer elements (8-bit) 04 v d v a v b 516 056 10 11 15 16 20 21 31 6 6 6 6 6 6 6 66 6 6 6 6 6 v b v a v d *6 = sh = shift count *6 125 127 0..0 sh 6 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0..0 zeros f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-152 altivec technology programming environments manual motorola altivec technology programming environments manual vsrh vsrh vector shift right half word vsrh v d, v a, v b form: vx do i=0 to 127 by 16 sh ( v b) i+12:i+15 v d i:i+15 ( v a) i:i+15 >> ui sh end each element is a half word. each element in v a is shifted right by the number of bits speci?d in the low-order 4 bits of the corresponding element in v b. bits shifted out of bit 15 of the element are lost. zeros are supplied to the vacated bits on the left. the result is placed into the corresponding element of v d. other registers altered: none figure 6-122 shows the usage of the vsrh instruction. each of the eight elements in the vectors, v a, and v d, is 16 bits long. figure 6-122. vsrh?hift bits right for eight integer elements (16-bit) 04 v d v a v b 580 056 10 11 15 16 20 21 31 66 6 6 6 6 6 v b v a v d *6 = sh = shift count *6 124 127 0...0 sh 0...0 0...0 0...0 0...0 0...0 0...0 0...0 zeros f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-153 altivec instruction set vsro vsro vector shift right octet vsro v d, v a, v b form: vx shb ( v b) 121:124 v d ( v a) >> ui (shb || 0b000) the contents of v a are shifted right by the number of bytes speci?d in v b[121?24]. bytes shifted out of v a are lost. zeros are supplied to the vacated bytes on the left. the result is placed into v d. other registers altered: none figure 6-123. vsro?ector shift right octet 04 v d v a v b 1100 056 10 11 15 16 20 21 31 v b v a v d ? ? ? ? ? ? ? ? ? *5 = shift count don? care *5 121 124 0 0 0 0 0 0 0 0 0 0 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-154 altivec technology programming environments manual motorola altivec technology programming environments manual vsrw vsrw vector shift right word vsrw v d, v a, v b form: vx do i=0 to 127 by 32 sh ( v b) i+(27):i+31 v d i:i+31 ( v a) i:i+31 >> ui sh end each element is a word. each element in v a is shifted right by the number of bits speci?d in the low-order 5 bits of the corresponding element in v b. bits shifted out of bit 31 of the element are lost. zeros are supplied to the vacated bits on the left. the result is placed into the corresponding element of v d. other registers altered: none figure 6-124 shows the usage of the vsrw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-124. vsrw?hift bits right in four integer elements (32-bit) 04 v d v a v b 644 056 10 11 15 16 20 21 31 6 6 6 v b v a v d *6 = sh = shift count *6 123 127 sh 0...0 0...0 0...0 zeros 0...0 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-155 altivec instruction set vsubcuw vsubcuw vector subtract carryout unsigned word vsubcuw v d, v a, v b form: vx do i=0 to 127 by 32 aop 0:32 zeroextend(( v a) i:i+31 ,33) bop 0:32 zeroextend(( v b) i:i+31 ,33) temp 0:32 aop 0:32 + int ? bop 0:32 + int 1 v d i:i+31 zeroextend(temp 0 ,32) end each unsigned-integer word element in v b is subtracted from the corresponding unsigned-integer word element in v a. the complement of the borrow out of bit 0 of the 32-bit difference is zero-extended to 32 bits and placed into the corresponding word element of v d. other registers altered: none figure 6-125 shows the usage of the vsubcuw instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. g figure 6-125. vsubcuw?ubtract carryout of four unsigned integer elements (32-bit) 04 v d v a v b 1408 056 10 11 15 16 20 21 31 v b v a zero-ext v d - - - - f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-156 altivec technology programming environments manual motorola altivec technology programming environments manual vsubfp vsubfp vector subtract floating point vsubfp v d, v a, v b form: vx do i=0 to 127 by 32 v d i:i+31 rndtonearfp32(( v a) i:i+31 - fp ( v b) i:i+31 ) end each single-precision ?ating-point word element in v b is subtracted from the corresponding single-precision ?ating-point word element in v a. the result is rounded to the nearest single-precision ?ating-point number and placed into the corresponding word element of v d. if vscr[nj] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each denormalized result element truncates to a 0 of the same sign. other registers altered: none figure 6-126 shows the usage of the vsubfp instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-126. vsubfp?ubtract four floating point elements (32-bit) 04 v d v a v b74 056 10 11 15 16 20 21 31 - fp - fp - fp - fp v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-157 altivec instruction set vsubsbs vsubsbs vector subtract signed byte saturate vsubsbs v d, v a, v b form: vx do i=0 to 127 by 8 aop 0:8 signextend(( v a) i:i+7 ,9) bop 0:8 signextend(( v b) i:i+7 ,9) temp 0:8 aop 0:8 + int ? bop 0:8 + int 1 v d i:i+7 sitosisat(temp 0:8 ,8) end each element is a byte. each signed-integer element in v b is subtracted from the corresponding signed-integer element in v a. if the intermediate result is greater than (2 7 -1) it saturates to (2 7 -1) and if it is less than -2 7 it saturates to -2 7 , where 8 is the length of the element. the signed-integer result is placed into the corresponding element of v d. other registers altered: ?at figure 6-127 shows the usage of the vsubsbs instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-127. vsubsbs?ubtract sixteen signed integer elements (8-bit) 04 v d v a v b 1792 056 10 11 15 16 20 21 31 - - - - - - - - - - - - - - - - v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-158 altivec technology programming environments manual motorola altivec technology programming environments manual vsubshs vsubshs vector subtract signed half word saturate vsubshs v d, v a, v b form: vx do i=0 to 127 by 16 aop 0:16 signextend(( v a) i:i+15 ,17) bop 0:16 signextend(( v b) i:i+15 ,17) temp 0:16 aop 0:16 + int -bop 0:16 + int 1 v d i:i+15 sitosisat(temp 0:16 ,16) end each element is a half word. each signed-integer element in v b is subtracted from the corresponding signed-integer element in v a. if the intermediate result is greater than (2 15 -1) it saturates to (2 15 -1) and if it is less than -2 15 it saturates to -2 15 , where 16 is the length of the element. the signed-integer result is placed into the corresponding element of v d. other registers altered: ?at figure 6-128 shows the usage of the vsubshs instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-128. vsubshs?ubtract eight signed integer elements (16-bit) 04 v d v a v b 1856 056 10 11 15 16 20 21 31 - - - - - - - - v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-159 altivec instruction set vsubsws vsubsws vector subtract signed word saturate vsubsws v d, v a, v b form: vx do i=0 to 127 by 32 aop 0:32 signextend(( v a) i:i+31 ,33) bop 0:32 signextend(( v b) i:i+31 ,33) temp 0:32 aop 0:32 + int ? bop 0:32 + int 1 v d i:i+31 sitosisat(temp 0:32 ,32) end each element is a word. each signed-integer element in v b is subtracted from the corresponding signed-integer element in v a. if the intermediate result is greater than (2 31 -1) it saturates to (2 31 -1) and if it is less than -2 31 it saturates to -2 31 , where 32 is the length of the element. the signed-integer result is placed into the corresponding element of v d. other registers altered: ?at figure 6-129 shows the usage of the vsubsws instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-129. vsubsws?ubtract four signed integer elements (32-bit) 04 v d v a v b 1920 056 10 11 15 16 20 21 31 - - - - v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-160 altivec technology programming environments manual motorola altivec technology programming environments manual vsububm vsububm vector subtract unsigned byte modulo vsububm v d, v a, v b form: vx do i=0 to 127 by 8 v d i:i+7 ( v a) i:i+7 + int ? ( v b) i:i+7 end each element of vsububm is a byte. each integer element in v b is subtracted from the corresponding integer element in v a. the integer result is placed into the corresponding element of v d. other registers altered: none note the vsububm instruction can be used for unsigned or signed integers. figure 6-130 shows the usage of the vsububm instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-130. vsububm?ubtract sixteen integer elements (8-bit) 04 v d v a v b 1024 056 10 11 15 16 20 21 31 - - - - - - - - - - - - - - - - v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-161 altivec instruction set vsububs vsububs vector subtract unsigned byte saturate vsububs v d, v a, v b form: vx do i=0 to 127 by 8 aop 0:8 zeroextend(( v a) i:i+7 ,9) bop 0:8 zeroextend(( v b) i:i+7 ,9) temp 0:8 aop 0:8 + int ? bop 0:8 + int 1 v d i:i+7 sitouisat(temp 0:8 ,8) end each element is a byte. each unsigned-integer element in v b is subtracted from the corresponding unsigned-integer element in v a. if the intermediate result is less than 0 it saturates to 0, where 8 is the length of the element. the unsigned-integer result is placed into the corresponding element of v d. other registers altered: ?at figure 6-131 shows the usage of the vsububs instruction. each of the sixteen elements in the vectors, v a, v b, and v d, is 8 bits long. figure 6-131. vsububs?ubtract sixteen unsigned integer elements (8-bit) 04 v d v a v b 1536 056 10 11 15 16 20 21 31 - - - - - - - - - - - - - - - - v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-162 altivec technology programming environments manual motorola altivec technology programming environments manual vsubuhm vsubuhm vector subtract signed half word modulo vsubuhm v d, v a, v b form: vx do i=0 to 127 by 16 v d i:i+15 ( v a) i:i+15 + int ? ( v b) i:i+15 end each element is a half word. each integer element in v b is subtracted from the corresponding integer element in v a. the integer result is placed into the corresponding element of v d. other registers altered: none note the vsubuhm instruction can be used for unsigned or signed integers. figure 6-132 shows the usage of the vsubuhm instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-132. vsubuhm?ubtract eight integer elements (16-bit) 04 v d v a v b 1088 056 10 11 15 16 20 21 31 - - - - - - - - v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-163 altivec instruction set vsubuhs vsubuhs vector subtract signed half word saturate vsubuhs v d, v a, v b form: vx do i=0 to 127 by 16 aop 0:16 zeroextend(( v a) i:i+15 ,17) bop 0:16 zeroextend(( v b) i:i+n:1 ,17) temp 0:16 aop 0:n + int ? bop 0:16 + int 1 v d i:i+15 sitouisat(temp 0:16 ,16) end each element is a half word. each unsigned-integer element in v b is subtracted from the corresponding unsigned-integer element in v a. if the intermediate result is less than 0 it saturates to 0, where 16 is the length of the element. the unsigned-integer result is placed into the corresponding element of v d. other registers altered: ?at figure 6-133 shows the usage of the vsubuhs instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-133. vsubuhs?ubtract eight signed integer elements (16-bit) 04 v d v a v b 1600 056 10 11 15 16 20 21 31 - - - - - - - - v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-164 altivec technology programming environments manual motorola altivec technology programming environments manual vsubuwm vsubuwm vector subtract unsigned word modulo vsubuwm v d, v a, v b form: vx do i=0 to 127 by 32 v d i:i+31 ( v a) i:i+31 + int ? ( v b) i:i+31 end each element of vsubuwm is a word. each integer element in v b is subtracted from the corresponding integer element in v a. the integer result is placed into the corresponding element of v d. other registers altered: none note the vsubuwm instruction can be used for unsigned or signed integers. figure 6-134 shows the usage of the vsubuwm instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-134. vsubuwm?ubtract four integer elements (32-bit) 04 v d v a v b 1152 056 10 11 15 16 20 21 31 - - - - v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-165 altivec instruction set vsubuws vsubuws vector subtract unsigned word saturate vsubuws v d, v a, v b form: vx do i=0 to 127 by 32 aop 0:32 zeroextend(( v a) i:i+31 ,33) bop 0:32 zeroextend(( v b) i:i+31 ,33) temp 0:32 aop 0:32 + int ? bop 0:32 + int 1 v d i:i+31 sitouisat(temp 0:32 ,32) end each element is a word. each unsigned-integer element in v b is subtracted from the corresponding unsigned-integer element in v a. if the intermediate result is less than 0 it saturates to 0, where 32 is the length of the element. the unsigned-integer result is placed into the corresponding element of v d. other registers altered: ?at figure 6-135 shows the usage of the vsubuws instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-135. vsubuws?ubtract four signed integer elements (32-bit) 04 v d v a v b 1664 056 10 11 15 16 20 21 31 - - - - v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-166 altivec technology programming environments manual motorola altivec technology programming environments manual vsumsws vsumsws vector sum across signed word saturate vsumsws v d, v a, v b form: vx temp 0:34 signextend(( v b) 96:127 ,35) do i=0 to 127 by 32 temp 0:34 temp 0:34 + int signextend(( v a) i:i+31 ,35) v d 96 0 || sitosisat(temp 0:34 ,32) end the signed-integer sum of the four signed-integer word elements in v a is added to the signed-integer word element in bits of v b[96-127]. if the intermediate result is greater than (2 31 -1) it saturates to (2 31 -1) and if it is less than -2 31 it saturates to -2 31 . the signed-integer result is placed into bits v d[96?27]. bits v d[0?5] are cleared. other registers altered: ?at figure 6-136 shows the usage of the vsumsws instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-136. vsumsws?um four signed integer elements (32-bit) 04 v d v a v b 1928 056 10 11 15 16 20 21 31 + v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-167 altivec instruction set vsum2sws vsum2sws vector sum across partial (1/2) signed word saturate vsum2sws v d, v a, v b form: vx do i=0 to 127 by 64 temp 0:33 signextend((v b ) i+32:i+63 ,34) do j=0 to 63 by 32 temp 0:33 temp 0:33 + int signextend((v a ) i+j:i+j+31 ,34) end v d i:i+63 32 0 || sitosisat(temp 0:33 ,32) end the signed-integer sum of the ?st two signed-integer word elements in register v a is added to the signed-integer word element in v b[32?3]. if the intermediate result is greater than (2 31 -1) it saturates to (2 31 -1) and if it is less than -2 31 it saturates to -2 31 . the signed-integer result is placed into v d[32?3]. the signed-integer sum of the last two signed-integer word elements in register v a is added to the signed-integer word element in v b[96-127]. if the intermediate result is greater than (2 31 -1) it saturates to (2 31 -1) and if it is less than -2 31 it saturates to -2 31 . the signed-integer result is placed into v d[96?27]. the register v d[0?1,64?5] are cleared to 0. other registers altered: ?at figure 6-137 shows the usage of the vsum2sws instruction. each of the four elements in the vectors, v a, v b, and v d, is 32 bits long. figure 6-137. vsum2sws?wo sums in the four signed integer elements (32-bit) 04 v d v a v b 1672 056 10 11 15 16 20 21 31 + v a v b v d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-168 altivec technology programming environments manual motorola altivec technology programming environments manual vsum4sbs vsum4sbs vector sum across partial (1/4) signed byte saturate vsum4sbs v d, v a, v b form: vx do i=0 to 127 by 32 temp 0:32 signextend(( v b) i:i+31 ,33) do j=0 to 31 by 8 temp 0:32 temp 0:32 + int signextend(( v a) i+j:i+j+7 ,33) end v d i:i+31 sitosisat(temp 0:32 ,32) end for each word element in v b the following operations are performed in the order shown. the signed-integer sum of the four signed-integer byte elements contained in the corresponding word element of register v a is added to the signed-integer word element in register v b. if the intermediate result is greater than (2 31 -1) it saturates to (2 31 -1) and if it is less than -2 31 it saturates to -2 31 . the signed-integer result is placed into the corresponding word element of v d. other registers altered: ?at figure 6-138 shows the usage of the vsum4sbs instruction. each of the sixteen elements in the vector v a, is 8 bits long. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-138. vsum4sbs?our sums in the integer elements (32-bit) 04 v d v a v b 1800 056 10 11 15 16 20 21 31 v a v b v d + + + + f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-169 altivec instruction set vsum4shs vsum4shs vector sum across partial (1/4) signed half word saturate vsum4shs v d, v a, v b form: vx do i=0 to 127 by 32 temp 0:32 signextend(( v b) i:i+31 ,33) do j=0 to 31 by 16 temp 0:32 temp 0:32 + int signextend(( v a) i+j:i+j+15 ,33) end v d i:i+31 sitosisat(temp 0:32 ,32) end for each word element in register v b the following operations are performed, in the order shown. the signed-integer sum of the two signed-integer halfword elements contained in the corresponding word element of register v a is added to the signed-integer word element in v b. if the intermediate result is greater than (2 31 -1) it saturates to (2 31 -1) and if it is less than -2 31 it saturates to -2 31 . the signed-integer result is placed into the corresponding word element of v d. other registers altered: ?at figure 6-139 shows the usage of the vsum4shs instruction. each of the eight elements in the vector v a, is 16 bits long. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-139. vsum4shs?our sums in the integer elements (32-bit) 04 v d v a v b 1608 056 10 11 15 16 20 21 31 v a v b v d + + + + f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-170 altivec technology programming environments manual motorola altivec technology programming environments manual vsum4ubs vsum4ubs vector sum across partial (1/4) unsigned byte saturate vsum4ubs v d, v a, v b form: vx do i=0 to 127 by 32 temp 0:32 zeroextend(( v b) i:i+31 ,33) do j=0 to 31 by 8 temp 0:32 temp 0:32 + int zeroextend(( v a) i+j:i+j+7 ,33) end v d i:i+31 uitouisat(temp 0:32 ,32) end for each word element in v b the following operations are performed in the order shown. the unsigned-integer sum of the four unsigned-integer byte elements contained in the corresponding word element of register v a is added to the unsigned-integer word element in register v b. if the intermediate result is greater than (2 32 -1) it saturates to (2 32 -1). the unsigned-integer result is placed into the corresponding word element of v d. other registers altered: ?at figure 6-140 shows the usage of the vsum4ubs instruction. each of the four elements in the vector v a, is 8 bits long. each of the four elements in the vectors v b and v d is 32 bits long. figure 6-140. vsum4ubs?our sums in the integer elements (32-bit) 04 v d v a v b 1544 056 10 11 15 16 20 21 31 v a v b v d + + + + f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-171 altivec instruction set vupkhpx vupkhpx vector unpack high pixel16 vupkhpx v d, v b form: vx do i=0 to 63 by 16 v d i*2:(i*2)+7 signextend(( v b) i ,8) v d (i*2)+8:(i*2)+15 zeroextend(( v b) i+1:i+5 ,8) v d (i*2)+16:(i*2)+23 zeroextend(( v b) i+6:i+10 ,8) v d (i*2)+24:(i*2)+31 zeroextend(( v b) i+11:i+15 ,8) end each halfword element in the high-order half of register v b is unpacked to produce a 32-bit value as described below and placed, in the same order, into the four words of v d. a halfword is unpacked to 32 bits by concatenating, in order, the results of the following operations. sign-extend bit 0 of the halfword to 8 bits zero-extend bits 1? of the halfword to 8 bits zero-extend bits 6?0 of the halfword to 8 bits zero-extend bits 11?5 of the halfword to 8 bits other registers altered: none the source and target elements can be considered to be 16-bit and 32-bit "pixels" respectively, having the formats described in the programming note for the vector pack pixel instruction. figure 6-141 shows the usage of the vupkhpx instruction. each of the eight elements in the vectors, v b, is 16 bits long. each of the four elements in the vectors, v d, is 32 bits long. figure 6-141. vupkhpx?npack high-order elements (16 bit) to elements (32-bit) 04 v d 0_0000 v b 846 056 10 11 15 16 20 21 31 v b v d 0 0 0 0 0 0 0 0 0 0 0 0 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-172 altivec technology programming environments manual motorola altivec technology programming environments manual vupkhsb vupkhsb vector unpack high signed byte vupkhsb v d, v b form: vx do i=0 to 63 by 8 v d i*2:(i*2)+15 signextend(( v b) i:i+7 ,16) end each signed integer byte element in the high-order half of register v b is sign-extended to produce a 16-bit signed integer and placed, in the same order, into the eight halfwords of register v d. other registers altered: none figure 6-142 shows the usage of the vupkhsb instruction. each of the sixteen elements in the vectors, v b, is 8 bits long. each of the eight elements in the vectors, v d, is 16 bits long. figure 6-142. vupkhsb?npack high-order signed integer elements (8-bit) to signed integer elements (16-bit) 04 v d 0_0000 v b 526 056 10 11 15 16 20 21 31 ss ss ss ss ss ss ss ss v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-173 altivec instruction set vupkhsh vupkhsh vector unpack high signed half word vupkhsh v d ,v b form: vx do i=0 to 63 by 16 v d i*2:(i*2)+31 signextend(( v b) i:i+15 ,32) end each signed integer halfword element in the high-order half of register v b is sign-extended to produce a 32-bit signed integer and placed, in the same order, into the four words of register v d. other registers altered: none figure 6-143 shows the usage of the vupkhsh instruction. each of the eight elements in the vectors v b and v d is 16 bits long. figure 6-143. vupkhsh?npack signed integer elements (16-bit) to signed integer elements (32-bit) 04 v d 0_0000 v b 590 056 10 11 15 16 20 21 31 v b v d ssss ssss ssss ssss f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-174 altivec technology programming environments manual motorola altivec technology programming environments manual vupklpx vupklpx vector unpack low pixel16 vupklpx v d, v b form: vx do i=0 to 63 by 16 v d i*2:(i*2)+7 signextend(( v b) i+64 ,8) v d (i*2)+8:(i*2)+15 zeroextend(( v b) i+65:i+69 ,8) v d (i*2)+16:(i*2)+23 zeroextend(( v b) i+70:i+74 ,8) v d (i*2)+24:(i*2)+31 zeroextend(( v b) i+75:i+79 ,8) end each halfword element in the low-order half of register v b is unpacked to produce a 32-bit value as described below and placed, in the same order, into the four words of register v d. a halfword is unpacked to 32 bits by concatenating, in order, the results of the following operations. sign-extend bit 0 of the halfword to 8 bits zero-extend bits 1? of the halfword to 8 bits zero-extend bits 6?0 of the halfword to 8 bits zero-extend bits 11?5 of the halfword to 8 bits other registers altered: none programming note: notice that the unpacking done by the vector unpack pixel instructions does not reverse the packing done by the vector pack pixel instruction. speci?ally, if a 16-bit pixel is unpacked to a 32-bit pixel which is then packed to a 16-bit pixel, the resulting 16-bit pixel will not, in general, be equal to the original 16-bit pixel (because, for each channel except the ?st, vector unpack pixel inserts high-order bits while vector pack pixel discards low-order bits). figure 6-144 shows the usage of the vupklpx instruction. each of the eight elements in the vectors, v b, is 16 bits long. each of the four elements in the vectors, v d, is 32 bits long. figure 6-144. vupklpx?npack low-order elements (16-bit) to elements (32-bit) 04 v d 0_0000 v b 974 056 10 11 15 16 20 21 31 v b v d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-175 altivec instruction set vupklsb vupklsb vector unpack low signed byte vupklsb v d, v b form: vx do i=0 to 63 by 8 v d i*2:(i*2)+15 signextend(( v b) i+64:i+71 ,16) end each signed integer byte element in the low-order half of register v b is sign-extended to produce a 16-bit signed integer and placed, in the same order, into the eight halfwords of register v d. other registers altered: none figure 6-145 shows the usage of the vaddubs instruction. each of the sixteen elements in the vectors v b and v d is 8 bits long. figure 6-145. vupklsb?npack low-order elements (8-bit) to elements (16-bit) 04 v d 0_0000 v b 654 056 10 11 15 16 20 21 31 v b v d ss ss ss ss ss ss ss ss f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-176 altivec technology programming environments manual motorola altivec technology programming environments manual vupklsh vupklsh vector unpack low signed half word vupklsh v d, v b form: vx do i=0 to 63 by 16 v d i*2:(i*2)+31 signextend(( v b) i+64:i+79 ,32) end each signed integer half word element in the low-order half of register v b is sign-extended to produce a 32-bit signed integer and placed, in the same order, into the four words of register v d. other registers altered: none figure 6-146 shows the usage of the vupklpx instruction. each of the eight elements in the vectors, v a, v b, and v d, is 16 bits long. figure 6-146. vupklsh?npack low-order signed integer elements (16-bit) to signed integer elements (32-bit) 04 v d 0_0000 v b 718 056 10 11 15 16 20 21 31 v b v d ssss ssss ssss ssss f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola chapter 6. altivec instructions 6-177 altivec instruction set vxor vxor vector logical xor vxor v d, v a, v b form: vx v d ( v a) ( v b) the contents of v a are xored with the contents of register v b and the result is placed into register v d. other registers altered: none figure 6-147 shows the usage of the vxor instruction. figure 6-147. vxor?itwise xor (128-bit) 04 v d v a v b 1220 056 10 11 15 16 20 21 31 v a v b v d f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
6-178 altivec technology programming environments manual motorola altivec technology programming environments manual f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix a. altivec instruction set listings a-1 appendix a altivec instruction set listings this appendix lists the instruction set for altivec technology. instructions are sorted by mnemonic, opcode, and form. also included in this appendix is a quick reference table that contains general information, such as the architecture level, privilege level, and form, and indicates if the instruction is optional. note that split fields, which represent the concatenation of sequences from left to right, are shown in lower case. a.1 instructions sorted by mnemonic in decimal format table a-1 lists the instructions implemented in the altivec architecture in alphabetical order by mnemonic.the primary and extended opcodes are decimal numbers. table a-1. instruction sorted by mnemonic in decimal format name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 dss 31 0 0_0 strm 0_0000 0000_0 822 0 dssall 31 1 0_0 strm 0_0000 0000_0 822 0 dst 31 0 0_0 strm a b 342 0 dstst 31 0 0_0 strm a b 374 0 dststt 31 1 0_0 strm a b 374 0 dstt 31 1 0_0 strm a b 342 0 lvebx 31 v da b 7 0 lvehx 31 v da b 39 0 lvewx 31 v da b 71 0 lvsl 31 v da b 6 0 lvsr 31 v da b 38 0 lvx 31 v d a b 103 0 reserved bits key: f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
a-2 altivec technology programming environments manual motorola instructions sorted by mnemonic in decimal format lvxl 31 v d a b 359 0 mfvscr 04 v d 0_0000 0000_0 1540 mtvscr 04 00_000 0_0000 v b 1604 stvebx 31 v s a b 135 0 stvehx 31 v s a b 167 0 stvewx 31 v s a b 199 0 stvx 31 v s a b 231 0 stvxl 31 v s a b 487 0 vaddcuw 04 v d v a v b 384 vaddfp 04 v d v a v b10 vaddsbs 04 v d v a v b 768 vaddshs 04 v d v a v b 832 vaddsws 04 v d v a v b 896 vaddubm 04 v d v a v b0 vaddubs 04 v d v a v b 512 vadduhm 04 v d v a v b64 vadduhs 04 v d v a v b 576 vadduwm 04 v d v a v b 128 vadduws 04 v d v a v b 640 vand 04 v d v a v b 1028 vandc 04 v d v a v b 1092 vavgsb 04 v d v a v b 1282 vavgsh 04 v d v a v b 1346 vavgsw 04 v d v a v b 1410 vavgub 04 v d v a v b 1026 vavguh 04 v d v a v b 1090 vavguw 04 v d v a v b 1154 vcfsx 04 v d uimm v b 842 vcfux 04 v d uimm v b 778 vcmpbfp x 04 v d v a v b rc 966 vcmpeqfp x 04 v d v a v b rc 198 vcmpequb x 04 v d v a v brc 6 vcmpequh x 04 v d v a v brc 70 table a-1. instruction sorted by mnemonic in decimal format (continued) name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix a. altivec instruction set listings a-3 instructions sorted by mnemonic in decimal format vcmpequw x 04 v d v a v b rc 134 vcmpgefp x 04 v d v a v b rc 454 vcmpgtfp x 04 v d v a v b rc 710 vcmpgtsb x 04 v d v a v b rc 774 vcmpgtsh x 04 v d v a v b rc 838 vcmpgtsw x 04 v d v a v b rc 902 vcmpgtub x 04 v d v a v b rc 518 vcmpgtuh x 04 v d v a v b rc 582 vcmpgtuw x 04 v d v a v b rc 646 vctsxs 04 v d uimm v b 970 vctuxs 04 v d uimm v b 906 vexptefp 04 v d 0_0000 v b 394 vlogefp 04 v d 0_0000 v b 458 vmaddfp 04 v d v a v b v c46 vmaxfp 04 v d v a v b 1034 vmaxsb 04 v d v a v b 258 vmaxsh 04 v d v a v b 322 vmaxsw 04 v d v a v b 386 vmaxub 04 v d v a v b2 vmaxuh 04 v d v a v b66 vmaxuw 04 v d v a v b 130 vmhaddshs 04 v d v a v b v c32 vmhraddshs 04 v d v a v b v c33 vminfp 04 v d v a v b 1098 vminsb 04 v d v a v b 770 vminsh 04 v d v a v b 834 vminsw 04 v d v a v b 898 vminub 04 v d v a v b 514 vminuh 04 v d v a v b 578 vminuw 04 v d v a v b 642 vmladduhm 04 v d v a v b v c34 vmrghb 04 v d v a v b12 vmrghh 04 v d v a v b76 table a-1. instruction sorted by mnemonic in decimal format (continued) name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
a-4 altivec technology programming environments manual motorola instructions sorted by mnemonic in decimal format vmrghw 04 v d v a v b 140 vmrglb 04 v d v a v b 268 vmrglh 04 v d v a v b 332 vmrglw 04 v d v a v b 396 vmsummbm 04 v d v a v b v c37 vmsumshm 04 v d v a v b v c40 vmsumshs 04 v d v a v b v c41 vmsumubm 04 v d v a v b v c36 vmsumuhm 04 v d v a v b v c38 vmsumuhs 04 v d v a v b v c39 vmulesb 04 v d v a v b 776 vmulesh 04 v d v a v b 840 vmuleub 04 v d v a v b 520 vmuleuh 04 v d v a v b 584 vmulosb 04 v d v a v b 264 vmulosh 04 v d v a v b 328 vmuloub 04 v d v a v b8 vmulouh 04 v d v a v b72 vnmsubfp 04 v d v a v b v c47 vnor 04 v d v a v b 1284 vor 04 v d v a v b 1156 vperm 04 v d v a v b v c43 vpkpx 04 v d v a v b 782 vpkshss 04 v d v a v b 398 vpkshus 04 v d v a v b 270 vpkswss 04 v d v a v b 462 vpkswus 04 v d v a v b 334 vpkuhum 04 v d v a v b14 vpkuhus 04 v d v a v b 142 vpkuwum 04 v d v a v b78 vpkuwus 04 v d v a v b 206 vrefp 04 v d 0_0000 v b 266 vr? 04 v d 0_0000 v b 714 table a-1. instruction sorted by mnemonic in decimal format (continued) name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix a. altivec instruction set listings a-5 instructions sorted by mnemonic in decimal format vr? 04 v d 0_0000 v b 522 vr? 04 v d 0_0000 v b 650 vr? 04 v d 0_0000 v b 586 vrlb 04 v d v a v b4 vrlh 04 v d v a v b68 vrlw 04 v d v a v b 132 vrsqrtefp 04 v d 0_0000 v b 330 vsel 04 v d v a v b v c42 vsl 04 v d v a v b 452 vslb 04 v d v a v b 260 vsldoi 04 v d v a v b 0sh 44 vslh 04 v d v a v b 324 vslo 04 v d v a v b 1036 vslw 04 v d v a v b 388 vspltb 04 v d uimm v b 524 vsplth 04 v d uimm v b 588 vspltisb 04 v d simm 0000_0 780 vspltish 04 v d simm 0000_0 844 vspltisw 04 v d simm 0000_0 908 vspltw 04 v d uimm v b 652 vsr 04 v d v a v b 708 vsrab 04 v d v a v b 772 vsrah 04 v d v a v b 836 vsraw 04 v d v a v b 900 vsrb 04 v d v a v b 516 vsrh 04 v d v a v b 580 vsro 04 v d v a v b 1100 vsrw 04 v d v a v b 644 vsubcuw 04 v d v a v b 1408 vsubfp 04 v d v a v b74 vsubsbs 04 v d v a v b 1792 vsubshs 04 v d v a v b 1856 vsubsws 04 v d v a v b 1920 table a-1. instruction sorted by mnemonic in decimal format (continued) name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
a-6 altivec technology programming environments manual motorola instructions sorted by mnemonic in decimal format vsububm 04 v d v a v b 1024 vsububs 04 v d v a v b 1536 vsubuhm 04 v d v a v b 1088 vsubuhs 04 v d v a v b 1600 vsubuwm 04 v d v a v b 1152 vsubuws 04 v d v a v b 1664 vsumsws 04 v d v a v b 1928 vsum2sws 04 v d v a v b 1672 vsum4sbs 04 v d v a v b 1800 vsum4shs 04 v d v a v b 1608 vsum4ubs 04 v d v a v b 1544 vupkhpx 04 v d 0_0000 v b 846 vupkhsb 04 v d 0_0000 v b 526 vupkhsh 04 v d 0_0000 v b 590 vupklpx 04 v d 0_0000 v b 974 vupklsb 04 v d 0_0000 v b 654 vupklsh 04 v d 0_0000 v b 718 vxor 04 v d v a v b 1220 table a-1. instruction sorted by mnemonic in decimal format (continued) name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix b. instructions sorted by mnemonic in binary format b-1 appendix b instructions sorted by mnemonic in binary format b.1 instructions sorted by mnemonic in binary format table b-1 lists the instructions implemented in the altivec architecture in alphabetical order by mnemonic.the primary and extended opcodes are decimal numbers. table b-1. instructions sorted by mnemonic in binary format name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 dss 0111_11 0 0_0 strm 0_0000 0000_0 110_0110_110 0 dssall 0111_11 1 0_0 strm 0_0000 0000_0 110_0110_110 0 dst 0111_11 0 0_0 strm a b 010_1010_110 0 dstst 0111_11 0 0_0 strm a b 010_1110_110 0 dststt 0111_11 1 0_0 strm a b 001__1110_110 0 dstt 0111_11 1 0_0 strm a b 010_1010_110 0 lvebx 0111_11 v d a b 000_0000_111 0 lvehx 0111_11 v d a b 000_0100_111 0 lvewx 0111_11 v d a b 000_1000_111 0 lvsl 0111_11 v d a b 000_0000_110 0 lvsr 0111_11 v d a b 000_0100_110 0 lvx 0111_11 v d a b 000_1100_111 0 lvxl 0111_11 v d a b 010_1100_111 0 mfvscr 0001_00 v d 0_0000 0000_0 110_0000_0100 mtvscr 0001_00 00_000 0_0000 v b 110_0100_0100 stvebx 0111_11 v s a b 001_0000_111 0 stvehx 0111_11 v s a b 001_0100_111 0 reserved bits key: f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
b-2 altivec technology programming environments manual motorola instructions sorted by mnemonic in binary format stvewx 0111_11 v s a b 001_1000_111 0 stvx 0111_11 v s a b 001_1100_111 0 stvxl 0111_11 v s a b 011_1100_111 0 vaddcuw 0001_00 v d v a v b 001_1000_0000 vaddfp 0001_00 v d v a v b 000_0000_1010 vaddsbs 0001_00 v d v a v b 011_0000_0000 vaddshs 0001_00 v d v a v b 011_0100_0000 vaddsws 0001_00 v d v a v b 011_1000_0000 vaddubm 0001_00 v d v a v b 000_0000_0000 vaddubs 0001_00 v d v a v b 010_0000_0000 vadduhm 0001_00 v d v a v b 000_0100_0000 vadduhs 0001_00 v d v a v b 010_0100_0000 vadduwm 0001_00 v d v a v b 000_1000_0000 vadduws 0001_00 v d v a v b 010_1000_0000 vand 0001_00 v d v a v b 100_0000_0100 vandc 0001_00 v d v a v b 100_0100_0100 vavgsb 0001_00 v d v a v b 101_0000_0010 vavgsh 0001_00 v d v a v b 101_0100_0010 vavgsw 0001_00 v d v a v b 101_1000_0010 vavgub 0001_00 v d v a v b 100_0000_0010 vavguh 0001_00 v d v a v b 100_0100_0010 vavguw 0001_00 v d v a v b 100_1000_0010 vcfsx 0001_00 v d uimm v b 011_0100_1010 vcfux 0001_00 v d uimm v b 011_0000_1010 vcmpbfp x 0001_00 v d v a v b rc 11_1100_0110 vcmpeqfp x 0001_00 v d v a v b rc 00_1100_0110 vcmpequb x 0001_00 v d v a v b rc 00_0000_0110 vcmpequh x 0001_00 v d v a v b rc 00_0100_0110 vcmpequw x 0001_00 v d v a v b rc 00_1000_0110 vcmpgefp x 0001_00 v d v a v b rc 01_1100_0110 vcmpgtfp x 0001_00 v d v a v b rc 10_1100_0110 vcmpgtsb x 0001_00 v d v a v b rc 11_0000_0110 vcmpgtsh x 0001_00 v d v a v b rc 11_0100_0110 table b-1. instructions sorted by mnemonic in binary format name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix b. instructions sorted by mnemonic in binary format b-3 instructions sorted by mnemonic in binary format vcmpgtsw x 0001_00 v d v a v b rc 11_1000_0110 vcmpgtub x 0001_00 v d v a v b rc 10_0000_0110 vcmpgtuh x 0001_00 v d v a v b rc 10_0100_0110 vcmpgtuw x 0001_00 v d v a v b rc 10_1000_0110 vctsxs 0001_00 v d uimm v b 011_1100_1010 vctuxs 0001_00 v d uimm v b 011_1000_1010 vexptefp 0001_00 v d 0_0000 v b 001_1000_1010 vlogefp 0001_00 v d 0_0000 v b 001_1100_1010 vmaddfp 0001_00 v d v a v b v c 10_1110 vmaxfp 0001_00 v d v a v b 100_0000_1010 vmaxsb 0001_00 v d v a v b 001_0000_0010 vmaxsh 0001_00 v d v a v b 001_0100_0010 vmaxsw 0001_00 v d v a v b 001_1000_0010 vmaxub 0001_00 v d v a v b 0000_0000_0010 vmaxuh 0001_00 v d v a v b 0100_0010 vmaxuw 0001_00 v d v a v b 1000_0010 vmhaddshs 0001_00 v d v a v b v c 10_0000 vmhraddshs 0001_00 v d v a v b v c 10_0001 vminfp 0001_00 v d v a v b 100_0100_1010 vminsb 0001_00 v d v a v b 011_0000_0010 vminsh 0001_00 v d v a v b 011_0100_0010 vminsw 0001_00 v d v a v b 011_1000_0010 vminub 0001_00 v d v a v b 010_0000_0010 vminuh 0001_00 v d v a v b 010_0100_0010 vminuw 0001_00 v d v a v b 010_1000_0010 vmladduhm 0001_00 v d v a v b v c 10_0010 vmrghb 0001_00 v d v a v b 000_0000_1100 vmrghh 0001_00 v d v a v b 000_0100_1100 vmrghw 0001_00 v d v a v b 000_1000_1100 vmrglb 0001_00 v d v a v b 001_0000_1100 vmrglh 0001_00 v d v a v b 001_0100_1100 vmrglw 0001_00 v d v a v b 001_1000_1100 vmsummbm 0001_00 v d v a v b v c 10_0101 table b-1. instructions sorted by mnemonic in binary format name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
b-4 altivec technology programming environments manual motorola instructions sorted by mnemonic in binary format vmsumshm 0001_00 v d v a v b v c 10_1000 vmsumshs 0001_00 v d v a v b v c 10_1001 vmsumubm 0001_00 v d v a v b v c 10_0100 vmsumuhm 0001_00 v d v a v b v c 10_0110 vmsumuhs 0001_00 v d v a v b v c 10_0111 vmulesb 0001_00 v d v a v b 011_0000_1000 vmulesh 0001_00 v d v a v b 011_0100_1000 vmuleub 0001_00 v d v a v b 010_0000_1000 vmuleuh 0001_00 v d v a v b 010_0100_1000 vmulosb 0001_00 v d v a v b 001_0000_1000 vmulosh 0001_00 v d v a v b 001_0100_1000 vmuloub 0001_00 v d v a v b 000_0000_1000 vmulouh 0001_00 v d v a v b 000_0100_1000 vnmsubfp 0001_00 v d v a v b v c 10_1111 vnor 0001_00 v d v a v b 101_0000_0100 vor 0001_00 v d v a v b 100_1000_0100 vperm 0001_00 v d v a v b v c 10_1011 vpkpx 0001_00 v d v a v b 011_0000_1110 vpkshss 0001_00 v d v a v b 001_1000_1110 vpkshus 0001_00 v d v a v b 001_0000_1110 vpkswss 0001_00 v d v a v b 001_1100_1110 vpkswus 0001_00 v d v a v b 001_0100_1110 vpkuhum 0001_00 v d v a v b 000_0000_1110 vpkuhus 0001_00 v d v a v b 000_1000_1110 vpkuwum 0001_00 v d v a v b 000_100_1110 vpkuwus 0001_00 v d v a v b 000_1100_1110 vrefp 0001_00 v d 0_0000 v b 001_0000_1010 vr? 0001_00 v d 0_0000 v b 010_1100_1010 vr? 0001_00 v d 0_0000 v b 010_0000_1010 vr? 0001_00 v d 0_0000 v b 010_1000_1010 vr? 0001_00 v d 0_0000 v b 010_0100_1010 vrlb 0001_00 v d v a v b 000_0000_0100 vrlh 0001_00 v d v a v b 000_0100_0100 table b-1. instructions sorted by mnemonic in binary format name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix b. instructions sorted by mnemonic in binary format b-5 instructions sorted by mnemonic in binary format vrlw 0001_00 v d v a v b 000_1000_0100 vrsqrtefp 0001_00 v d 0_0000 v b 001_0100_1010 vsel 0001_00 v d v a v b v c 10_1010 vsl 0001_00 v d v a v b 1_1100_0100 vslb 0001_00 v d v a v b 1_0000_0100 vsldoi 0001_00 v d v a v b 0 sh 10_1100 vslh 0001_00 v d v a v b 01_0100_0100 vslo 0001_00 v d v a v b 100_0000_1100 vslw 0001_00 v d v a v b 001_1000_0100 vspltb 0001_00 v d uimm v b 010_0000_1100 vsplth 0001_00 v d uimm v b 010_0100_1100 vspltisb 0001_00 v d simm 0000_0 011_0000_1100 vspltish 0001_00 v d simm 0000_0 011_0100_1100 vspltisw 0001_00 v d simm 0000_0 011_1000_1100 vspltw 0001_00 v d uimm v b 010_1000_1100 vsr 0001_00 v d v a v b 010_1100_0100 vsrab 0001_00 v d v a v b 011_0000_0100 vsrah 0001_00 v d v a v b 011_0100_0100 vsraw 0001_00 v d v a v b 011_1000_0100 vsrb 0001_00 v d v a v b 010_0000_0100 vsrh 0001_00 v d v a v b 010_0100_0100 vsro 0001_00 v d v a v b 100_0100_1100 vsrw 0001_00 v d v a v b 010_1000_0100 vsubcuw 0001_00 v d v a v b 101_1000_0000 vsubfp 0001_00 v d v a v b 000_0100_1010 vsubsbs 0001_00 v d v a v b 111_0000_0000 vsubshs 0001_00 v d v a v b 111_0100_0000 vsubsws 0001_00 v d v a v b 111_1000_0000 vsububm 0001_00 v d v a v b 100_0000_0000 vsububs 0001_00 v d v a v b 110_0000_0000 vsubuhm 0001_00 v d v a v b 100_0100_0000 vsubuhs 0001_00 v d v a v b 110_0100_0000 vsubuwm 0001_00 v d v a v b 100_1000_0000 table b-1. instructions sorted by mnemonic in binary format name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
b-6 altivec technology programming environments manual motorola instructions sorted by mnemonic in binary format vsubuws 0001_00 v d v a v b 110_1000_0000 vsumsws 0001_00 v d v a v b 111_1000_1000 vsum2sws 0001_00 v d v a v b 110_1000_1000 vsum4sbs 0001_00 v d v a v b 111_0000_1000 vsum4shs 0001_00 v d v a v b 110_0100_1000 vsum4ubs 0001_00 v d v a v b 110_0000_1000 vupkhpx 0001_00 v d 0_0000 v b 011_0100_1110 vupkhsb 0001_00 v d 0_0000 v b 010_0000_1110 vupkhsh 0001_00 v d 0_0000 v b 010_0100_1110 vupklpx 0001_00 v d 0_0000 v b 011_1100_1110 vupklsb 0001_00 v d 0_0000 v b 010_1000_1110 vupklsh 0001_00 v d 0_0000 v b 010_1100_1110 vxor 0001_00 v d v a v b 100_1100_0100 table b-1. instructions sorted by mnemonic in binary format name 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix c. instructions sorted by opcode c-1 appendix c instructions sorted by opcode c.1 instructions sorted by opcode in decimal format table c-1 lists altivec instructions grouped by opcode in decimal format. . table c-1. instructions sorted by opcode in decimal format name 0 56 7 8 9 10 11121314151617181920 21 22232425262728293031 vmhaddshs 04 v d v a v b v c32 vmhraddshs 04 v d v a v b v c33 vmladduhm 04 v d v a v b v c34 vmsumubm 04 v d v a v b v c36 vmsummbm 04 v d v a v b v c37 vmsumuhm 04 v d v a v b v c38 vmsumuhs 04 v d v a v b v c39 vmsumshm 04 v d v a v b v c40 vmsumshs 04 v d v a v b v c41 vsel 04 v d v a v b v c42 vperm 04 v d v a v b v c43 vsldoi 04 v d v a v b 0sh 44 vmaddfp 04 v d v a v b46 vnmsubfp 04 v d v a v b v c47 vaddubm 04 v d v a v b0 vadduhm 04 v d v a v b64 vadduwm 04 v d v a v b 128 vaddcuw 04 v d v a v b 384 vaddubs 04 v d v a v b 512 reserved bits key: f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
c-2 altivec technology programming environments manual motorola instructions sorted by opcode in decimal format vadduhs 04 v d v a v b 576 vadduws 04 v d v a v b 640 vaddsbs 04 v d v a v b 768 vaddshs 04 v d v a v b 832 vaddsws 04 v d v a v b 896 vsububm 04 v d v a v b 1024 vsubuhm 04 v d v a v b 1088 vsubuwm 04 v d v a v b 1152 vsubcuw 04 v d v a v b 1408 vsububs 04 v d v a v b 1536 vsubuhs 04 v d v a v b 1600 vsubuws 04 v d v a v b 1664 vsubsbs 04 v d v a v b 1792 vsubshs 04 v d v a v b 1856 vsubsws 04 v d v a v b 1920 vmaxub 04 v d v a v b2 vmaxuh 04 v d v a v b66 vmaxuw 04 v d v a v b 130 vmaxsb 04 v d v a v b 258 vmaxsh 04 v d v a v b 322 vmaxsw 04 v d v a v b 386 vminub 04 v d v a v b 514 vminuh 04 v d v a v b 578 vminuw 04 v d v a v b 642 vminsb 04 v d v a v b 770 vminsh 04 v d v a v b 834 vminsw 04 v d v a v b 898 vavgub 04 v d v a v b 1026 vavguh 04 v d v a v b 1090 vavguw 04 v d v a v b 1154 vavgsb 04 v d v a v b 1282 vavgsh 04 v d v a v b 1346 vavgsw 04 v d v a v b 1410 table c-1. instructions sorted by opcode in decimal format (continued) name 0 56 7 8 9 10 11121314151617181920 21 22232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix c. instructions sorted by opcode c-3 instructions sorted by opcode in decimal format vrlb 04 v d v a v b4 vrlh 04 v d v a v b68 vrlw 04 v d v a v b 132 vslb 04 v d v a v b 260 vslh 04 v d v a v b 324 vslw 04 v d v a v b 388 vsl 04 v d v a v b 452 vsrb 04 v d v a v b 516 vsrh 04 v d v a v b 580 vsrw 04 v d v a v b 644 vsr 04 v d v a v b 708 vsrab 04 v d v a v b 772 vsrah 04 v d v a v b 836 vsraw 04 v d v a v b 900 vand 04 v d v a v b 1028 vandc 04 v d v a v b 1092 vor 04 v d v a v b 1156 vxor 04 v d v a v b 1220 vnor 04 v d v a v b 1284 mfvscr 04 v d 0_0000 0000_0 1540 mtvscr 04 00_000 0_0000 v b 1604 vcmpequb x 04 v d v a v brc 6 vcmpequh x 04 v d v a v brc 70 vcmpequw x 04 v d v a v b rc 134 vcmpeqfp x 04 v d v a v b rc 198 vcmpgefp x 04 v d v a v b rc 454 vcmpgtub x 04 v d v a v b rc 518 vcmpgtuh x 04 v d v a v b rc 582 vcmpgtuw x 04 v d v a v b rc 646 vcmpgtfp x 04 v d v a v b rc 710 vcmpgtsb x 04 v d v a v b rc 774 vcmpgtsh x 04 v d v a v b rc 838 vcmpgtsw x 04 v d v a v b rc 902 table c-1. instructions sorted by opcode in decimal format (continued) name 0 56 7 8 9 10 11121314151617181920 21 22232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
c-4 altivec technology programming environments manual motorola instructions sorted by opcode in decimal format vcmpbfp x 04 v d v a v b rc 966 vmuloub 04 v d v a v b8 vmulouh 04 v d v a v b72 vmulosb 04 v d v a v b 264 vmulosh 04 v d v a v b 328 vmuleub 04 v d v a v b 520 vmuleuh 04 v d v a v b 584 vmulesb 04 v d v a v b 776 vmulesh 04 v d v a v b 840 vsum4ubs 04 v d v a v b 1544 vsum4sbs 04 v d v a v b 1800 vsum4shs 04 v d v a v b 1608 vsum2sws 04 v d v a v b 1672 vsumsws 04 v d v a v b 1928 vaddfp 04 v d v a v b10 vsubfp 04 v d v a v b74 vrefp 04 v d 0_0000 v b 266 vrsqrtefp 04 v d 0_0000 v b 330 vexptefp 04 v d 0_0000 v b 394 vlogefp 04 v d 0_0000 v b 458 vr? 04 v d 0_0000 v b 522 vr? 04 v d 0_0000 v b 586 vr? 04 v d 0_0000 v b 650 vr? 04 v d 0_0000 v b 714 vcfux 04 v d uimm v b 778 vcfsx 04 v d uimm v b 842 vctuxs 04 v d uimm v b 906 vctsxs 04 v d uimm v b 970 vmaxfp 04 v d v a v b 1034 vminfp 04 v d v a v b 1098 vmrghb 04 v d v a v b12 vmrghh 04 v d v a v b76 vmrghw 04 v d v a v b 140 table c-1. instructions sorted by opcode in decimal format (continued) name 0 56 7 8 9 10 11121314151617181920 21 22232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix c. instructions sorted by opcode c-5 instructions sorted by opcode in decimal format vmrglb 04 v d v a v b 268 vmrglh 04 v d v a v b 332 vmrglw 04 v d v a v b 396 vspltb 04 v d uimm v b 524 vsplth 04 v d uimm v b 588 vspltw 04 v d uimm v b 652 vspltisb 04 v d simm 0000_0 780 vspltish 04 v d simm 0000_0 844 vspltisw 04 v d simm 0000_0 908 vslo 04 v d v a v b 1036 vsro 04 v d v a v b 1100 vpkuhum 04 v d v a v b14 vpkuwum 04 v d v a v b78 vpkuhus 04 v d v a v b 142 vpkuwus 04 v d v a v b 206 vpkshus 04 v d v a v b 270 vpkswus 04 v d v a v b 334 vpkshss 04 v d v a v b 398 vpkswss 04 v d v a v b 462 vupkhsb 04 v d 0_0000 v b 526 vupkhsh 04 v d 0_0000 v b 590 vupklsb 04 v d 0_0000 v b 654 vupklsh 04 v d 0_0000 v b 718 vpkpx 04 v d v a v b 782 vupkhpx 04 v d 0_0000 v b 846 vupklpx 04 v d 0_0000 v b 974 lvsl 31 v dab 6 0 lvsr 31 v dab 38 0 dst 31 0 0_0 strm a b 342 0 dstt 31 1 0_0 strm a b 342 0 dstst 31 0 0_0 strm a b 374 0 dststt 31 1 0_0 strm a b 374 0 dss 31 0 0_0 strm 0_0000 0000_0 822 0 table c-1. instructions sorted by opcode in decimal format (continued) name 0 56 7 8 9 10 11121314151617181920 21 22232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
c-6 altivec technology programming environments manual motorola instructions sorted by opcode in decimal format dssall 31 1 0_0 strm 0_0000 0000_0 822 0 lvebx 31 v dab 71 0 lvehx 31 v dab 39 0 lvewx 31 v dab 0 0 lvx 31 v d a b 103 0 lvxl 31 v d a b 359 0 stvebx 31 v s a b 135 0 stvehx 31 v s a b 167 0 stvewx 31 v s a b 199 0 stvx 31 v s a b 231 0 stvxl 31 v s a b 487 0 table c-1. instructions sorted by opcode in decimal format (continued) name 0 56 7 8 9 10 11121314151617181920 21 22232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix d. instructions sorted by opcode d-1 appendix d instructions sorted by opcode d.1 instructions sorted by opcode in binary format table d-1 lists altivec instructions grouped by opcode in binary format. . table d-1. instructions sorted by opcode in binary format name 0 --------------- 5 6 7 8 9 10 11121314151617181920 21 22232425262728293031 vmhaddshs 0001_00 v d v a v b v c 10_0000 vmhraddshs 0001_00 v d v a v b v c 10_0001 vmladduhm 0001_00 v d v a v b v c 10_0010 vmsumubm 0001_00 v d v a v b v c 10_0100 vmsummbm 0001_00 v d v a v b v c 10_0101 vmsumuhm 0001_00 v d v a v b v c 10_0110 vmsumuhs 0001_00 v d v a v b v c 10_0111 vmsumshm 0001_00 v d v a v b v c 10_1000 vmsumshs 0001_00 v d v a v b v c 10_1001 vsel 0001_00 v d v a v b v c 10_1010 vperm 0001_00 v d v a v b v c 10_1011 vsldoi 0001_00 v d v a v b 0 sh 10_1100 vmaddfp 0001_00 v d v a v b 000_0010_1110 vnmsubfp 0001_00 v d v a v b v c 10_1111 vaddubm 0001_00 v d v a v b 000_0000_0000 vadduhm 0001_00 v d v a v b 000_0100_0000 vadduwm 0001_00 v d v a v b 000_1000_0000 vaddcuw 0001_00 v d v a v b 001_1000_0000 vaddubs 0001_00 v d v a v b 010_0000_0000 vadduhs 0001_00 v d v a v b 010_0100_0000 reserved bits key: f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
d-2 altivec technology programming environments manual motorola instructions sorted by opcode in binary format vadduws 0001_00 v d v a v b 010_1000_0000 vaddsbs 0001_00 v d v a v b 011_0000_0000 vaddshs 0001_00 v d v a v b 011_0100_0000 vaddsws 0001_00 v d v a v b 011_1000_0000 vsububm 0001_00 v d v a v b 100_0000_0000 vsubuhm 0001_00 v d v a v b 100_0100_0000 vsubuwm 0001_00 v d v a v b 100_1000_0000 vsubcuw 0001_00 v d v a v b 101_1000_0000 vsububs 0001_00 v d v a v b 110_0000_0000 vsubuhs 0001_00 v d v a v b 110_0100_0000 vsubuws 0001_00 v d v a v b 110_1000_0000 vsubsbs 0001_00 v d v a v b 111_0000_0000 vsubshs 0001_00 v d v a v b 111_0100_0000 vsubsws 0001_00 v d v a v b 111_1000_0000 vmaxub 0001_00 v d v a v b 000_0000_0010 vmaxuh 0001_00 v d v a v b 000_0100_0010 vmaxuw 0001_00 v d v a v b 000_1000_0010 vmaxsb 0001_00 v d v a v b 001_0000_0010 vmaxsh 0001_00 v d v a v b 001_0100_0010 vmaxsw 0001_00 v d v a v b 001_1000_0010 vminub 0001_00 v d v a v b 010_0000_0010 vminuh 0001_00 v d v a v b 010_0100_0010 vminuw 0001_00 v d v a v b 010_1000_0010 vminsb 0001_00 v d v a v b 011_0000_0010 vminsh 0001_00 v d v a v b 011_0100_0010 vminsw 0001_00 v d v a v b 011_1000_0010 vavgub 0001_00 v d v a v b 100_0000_0010 vavguh 0001_00 v d v a v b 100_0100_0010 vavguw 0001_00 v d v a v b 100_1000_0010 vavgsb 0001_00 v d v a v b 101_0000_0010 vavgsh 0001_00 v d v a v b 101_0100_0010 vavgsw 0001_00 v d v a v b 101_1000_0010 vrlb 0001_00 v d v a v b 000_0000_0100 table d-1. instructions sorted by opcode in binary format (continued) name 0 --------------- 5 6 7 8 9 10 11121314151617181920 21 22232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix d. instructions sorted by opcode d-3 instructions sorted by opcode in binary format vrlh 0001_00 v d v a v b 000_0100_0100 vrlw 0001_00 v d v a v b 000_1000_0100 vslb 0001_00 v d v a v b 001_0000_0100 vslh 0001_00 v d v a v b 001_0100_0100 vslw 0001_00 v d v a v b 001_1000_0100 vsl 0001_00 v d v a v b 001_1100_0100 vsrb 0001_00 v d v a v b 010_0000_0100 vsrh 0001_00 v d v a v b 010_0100_0100 vsrw 0001_00 v d v a v b 010_1000_0100 vsr 0001_00 v d v a v b 010_1100_0100 vsrab 0001_00 v d v a v b 011_0000_0100 vsrah 0001_00 v d v a v b 011_0100_0100 vsraw 0001_00 v d v a v b 011_1000_0100 vand 0001_00 v d v a v b 100_0000_0100 vandc 0001_00 v d v a v b 100_0100_0100 vor 0001_00 v d v a v b 100_1000_0100 vxor 0001_00 v d v a v b 100_1100_0100 vnor 0001_00 v d v a v b 101_0000_0100 mfvscr 0001_00 v d 0_0000 0000_0 110_0000_0100 mtvscr 0001_00 00_000 0_0000 v b 110_0100_0100 vcmpequb x 0001_00 v d v a v b rc 00_0000_0110 vcmpequh x 0001_00 v d v a v b rc 00_0100_0110 vcmpequw x 0001_00 v d v a v b rc 00_1000_0110 vcmpeqfp x 0001_00 v d v a v b rc 00_1100_0110 vcmpgefp x 0001_00 v d v a v b rc 01_1100_0110 vcmpgtub x 0001_00 v d v a v b rc 10_0000_0110 vcmpgtuh x 0001_00 v d v a v b rc 10_0100_0110 vcmpgtuw x 0001_00 v d v a v b rc 10_1000_0110 vcmpgtfp x 0001_00 v d v a v b rc 10_1100_0110 vcmpgtsb x 0001_00 v d v a v b rc 11_0000_0110 vcmpgtsh x 0001_00 v d v a v b rc 11_0100_0110 vcmpgtsw x 0001_00 v d v a v b rc 11_1000_0110 vcmpbfp x 0001_00 v d v a v b rc 11_1100_0110 table d-1. instructions sorted by opcode in binary format (continued) name 0 --------------- 5 6 7 8 9 10 11121314151617181920 21 22232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
d-4 altivec technology programming environments manual motorola instructions sorted by opcode in binary format vmuloub 0001_00 v d v a v b 000_0000_1000 vmulouh 0001_00 v d v a v b 000_0100_1000 vmulosb 0001_00 v d v a v b 001_0000_1000 vmulosh 0001_00 v d v a v b 001_0100_1000 vmuleub 0001_00 v d v a v b 010_0000_1000 vmuleuh 0001_00 v d v a v b 010_0100_1000 vmulesb 0001_00 v d v a v b 011_0000_1000 vmulesh 0001_00 v d v a v b 011_0100_1000 vsum4ubs 0001_00 v d v a v b 110_0000_1000 vsum4sbs 0001_00 v d v a v b 111_0000_1000 vsum4shs 0001_00 v d v a v b 110_0100_1000 vsum2sws 0001_00 v d v a v b 110_1000_1000 vsumsws 0001_00 v d v a v b 111_1000_1000 vaddfp 0001_00 v d v a v b 000_0000_1010 vsubfp 0001_00 v d v a v b 000_0100_1010 vrefp 0001_00 v d 0_0000 v b 001_0000_1010 vrsqrtefp 0001_00 v d 0_0000 v b 001_0100_1010 vexptefp 0001_00 v d 0_0000 v b 001_1000_1010 vlogefp 0001_00 v d 0_0000 v b 001_1100_1010 vr? 0001_00 v d 0_0000 v b 010_0000_1010 vr? 0001_00 v d 0_0000 v b 010_0100_1010 vr? 0001_00 v d 0_0000 v b 010_1000_1010 vr? 0001_00 v d 0_0000 v b 010_1100_1010 vcfux 0001_00 v d uimm v b 011_0000_1010 vcfsx 0001_00 v d uimm v b 011_0100_1010 vctuxs 0001_00 v d uimm v b 011_1000_1010 vctsxs 0001_00 v d uimm v b 011_1100_1010 vmaxfp 0001_00 v d v a v b 100_0000_1010 vminfp 0001_00 v d v a v b 100_0100_1010 vmrghb 0001_00 v d v a v b 000_0000_1100 vmrghh 0001_00 v d v a v b 000_0100_1100 vmrghw 0001_00 v d v a v b 000_1000_1100 vmrglb 0001_00 v d v a v b 001_0000_1100 table d-1. instructions sorted by opcode in binary format (continued) name 0 --------------- 5 6 7 8 9 10 11121314151617181920 21 22232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix d. instructions sorted by opcode d-5 instructions sorted by opcode in binary format vmrglh 0001_00 v d v a v b 001_0100_1100 vmrglw 0001_00 v d v a v b 001_1000_1100 vspltb 0001_00 v d uimm v b 010_0000_1100 vsplth 0001_00 v d uimm v b 010_0100_1100 vspltw 0001_00 v d uimm v b 010_1000_1100 vspltisb 0001_00 v d simm 0000_0 011_0000_1100 vspltish 0001_00 v d simm 0000_0 011_0100_1100 vspltisw 0001_00 v d simm 0000_0 011_1000_1100 vslo 0001_00 v d v a v b 100_0000_1100 vsro 0001_00 v d v a v b 100_0100_1100 vpkuhum 0001_00 v d v a v b 000_0000_1110 vpkuwum 0001_00 v d v a v b 000_0100_1110 vpkuhus 0001_00 v d v a v b 000_1000_1110 vpkuwus 0001_00 v d v a v b 000_1100_1110 vpkshus 0001_00 v d v a v b 001_0000_1110 vpkswus 0001_00 v d v a v b 001_0100_1110 vpkshss 0001_00 v d v a v b 001_1000_1110 vpkswss 0001_00 v d v a v b 001_1100_1110 vupkhsb 0001_00 v d 0_0000 v b 010_0000_1110 vupkhsh 0001_00 v d 0_0000 v b 010_0100_1110 vupklsb 0001_00 v d 0_0000 v b 010_1000_1110 vupklsh 0001_00 v d 0_0000 v b 010_1100_1110 vpkpx 0001_00 v d v a v b 0110000_1110 vupkhpx 0001_00 v d 0_0000 v b 011_0100_1110 vupklpx 0001_00 v d 0_0000 v b 011_1100_1110 lvsl 0111_11 v d a b 000_0000_110 0 lvsr 0111_11 v d a b 000_0100_110 0 dst 0111_11 0 0_0 strm a b 010_1010_110 0 dstt 0111_1 1 0_0 strm a b 010_1010_110 0 dstst 0111_11 0 0_0 strm a b 010_1110_110 0 dststt 0111_11 1 0_0 strm a b 010_1110_110 0 dss 0111_11 0 0_0 strm 0_0000 0000_0 110_0110_110 0 dssall 0111_11 1 0_0 strm 0_0000 0000_0 110_0110_110 0 table d-1. instructions sorted by opcode in binary format (continued) name 0 --------------- 5 6 7 8 9 10 11121314151617181920 21 22232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
d-6 altivec technology programming environments manual motorola instructions sorted by opcode in binary format lvebx 0111_11 v d a b 000_0000_111 0 lvehx 0111_11 v d a b 000_0100_111 0 lvewx 0111_11 v d a b 000_1000_111 0 lvx 0111_11 v d a b 000_1100_111 0 lvxl 0111_11 v d a b 010_1100_111 0 stvebx 0111_11 v s a b 001_0000_111 0 stvehx 0111_11 v s a b 001_0100_111 0 stvewx 0111_11 v s a b 001_1000_111 0 stvx 0111_11 v s a b 001_1100_111 0 stvxl 0111_11 v s a b 011_1100_111 0 table d-1. instructions sorted by opcode in binary format (continued) name 0 --------------- 5 6 7 8 9 10 11121314151617181920 21 22232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix e. instructions sorted by form e-1 appendix e instructions sorted by form e.1 instructions sorted by form table e-1 through table e-4 list the altivec instructions grouped by form. table e-1. va-form opcd v d v a v b v cxo opcd v d v a v b 0sh xo speci? instructions name 0 5678910111213141516171819202122232425262728293031 vmhaddshs 04 v d v a v b v c32 vmhraddshs 04 v d v a v b v c33 vmladduhm 04 v d v a v b v c34 vmsumubm 04 v d v a v b v c36 vmsummbm 04 v d v a v b v c37 vmsumuhm 04 v d v a v b v c38 vmsumuhs 04 v d v a v b v c39 vmsumshm 04 v d v a v b v c40 vmsumshs 04 v d v a v b v c41 vsel 04 v d v a v b v c42 vperm 04 v d v a v b v c43 vsldoi 04 v d v a v b 0sh 44 vmaddfp 04 v d v a v b v c46 vnmsubfp 04 v d v a v b v c47 reserved bits key: f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
e-2 altivec technology programming environments manual motorola instructions sorted by form table e-2. vx-form opcd v d v a v bxo opcd v d 0_0000 0000_0 xo 0 opcd 00_000 0_0000 v bxo 0 opcd v d 0_0000 v bxo opcd v d uimm v bxo opcd v d simm 0000_0 xo speci? instructions name 0 5678910111213141516171819202122232425262728293031 vaddubm 04 v d v a v b0 vadduhm 04 v d v a v b64 vadduwm 04 v d v a v b 128 vaddcuw 04 v d v a v b 384 vaddubs 04 v d v a v b 512 vadduhs 04 v d v a v b 576 vadduws 04 v d v a v b 640 vaddsbs 04 v d v a v b 768 vaddshs 04 v d v a v b 832 vaddsws 04 v d v a v b 896 vsububm 04 v d v a v b 1024 vsubuhm 04 v d v a v b 1088 vsubuwm 04 v d v a v b 1152 vsubcuw 04 v d v a v b 1408 vsububs 04 v d v a v b 1536 vsubuhs 04 v d v a v b 1600 vsubuws 04 v d v a v b 1664 vsubsbs 04 v d v a v b 1792 vsubshs 04 v d v a v b 1856 vsubsws 04 v d v a v b 1920 vmaxub 04 v d v a v b2 vmaxuh 04 v d v a v b66 vmaxuw 04 v d v a v b 130 vmaxsb 04 v d v a v b 258 vmaxsh 04 v d v a v b 322 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix e. instructions sorted by form e-3 instructions sorted by form vmaxsw 04 v d v a v b 386 vminub 04 v d v a v b 514 vminuh 04 v d v a v b 578 vminuw 04 v d v a v b 642 vminsb 04 v d v a v b 770 vminsh 04 v d v a v b 834 vminsw 04 v d v a v b 898 vavgub 04 v d v a v b 1026 vavguh 04 v d v a v b 1090 vavguw 04 v d v a v b 1154 vavgsb 04 v d v a v b 1282 vavgsh 04 v d v a v b 1346 vavgsw 04 v d v a v b 1410 vrlb 04 v d v a v b4 vrlh 04 v d v a v b68 vrlw 04 v d v a v b 132 vslb 04 v d v a v b 260 vslh 04 v d v a v b 324 vslw 04 v d v a v b 388 vsl 04 v d v a v b 452 vsrb 04 v d v a v b 516 vsrh 04 v d v a v b 580 vsrw 04 v d v a v b 644 vsr 04 v d v a v b 708 vsrab 04 v d v a v b 772 vsrah 04 v d v a v b 836 vsraw 04 v d v a v b 900 vand 04 v d v a v b 1028 vandc 04 v d v a v b 1092 vor 04 v d v a v b 1156 vnor 04 v d v a v b 1284 mfvscr 04 v d 0_0000 0000_0 1540 mtvscr 04 00_000 0_0000 v b 1604 speci? instructions name 0 5678910111213141516171819202122232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
e-4 altivec technology programming environments manual motorola instructions sorted by form vmuloub 04 v d v a v b8 vmulouh 04 v d v a v b72 vmulosb 04 v d v a v b 264 vmulosh 04 v d v a v b 328 vmuleub 04 v d v a v b 520 vmuleuh 04 v d v a v b 584 vmulesb 04 v d v a v b 776 vmulesh 04 v d v a v b 840 vsum4ubs 04 v d v a v b 1544 vsum4sbs 04 v d v a v b 1800 vsum4shs 04 v d v a v b 1608 vsum2sws 04 v d v a v b 1672 vsumsws 04 v d v a v b 1928 vaddfp 04 v d v a v b10 vsubfp 04 v d v a v b74 vrefp 04 v d 0_0000 v b 266 vrsqrtefp 04 v d 0_0000 v b 330 vexptefp 04 v d 0_0000 v b 394 vlogefp 04 v d 0_0000 v b 458 vr? 04 v d 0_0000 v b 522 vr? 04 v d 0_0000 v b 586 vr? 04 v d 0_0000 v b 650 vr? 04 v d 0_0000 v b 714 vcfux 04 v d uimm v b 778 vcfsx 04 v d uimm v b 842 vctuxs 04 v d uimm v b 906 vctsxs 04 v d uimm v b 970 vmaxfp 04 v d v a v b 1034 vminfp 04 v d v a v b 1098 vmrghb 04 v d v a v b12 vmrghh 04 v d v a v b76 vmrghw 04 v d v a v b 140 vmrglb 04 v d v a v b 268 speci? instructions name 0 5678910111213141516171819202122232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix e. instructions sorted by form e-5 instructions sorted by form vmrglh 04 v d v a v b 332 vmrglw 04 v d v a v b 396 vspltb 04 v d uimm v b 524 vsplth 04 v d uimm v b 588 vspltw 04 v d uimm v b 652 vspltisb 04 v d simm 0000_0 780 vspltish 04 v d simm 0000_0 844 vspltisw 04 v d simm 0000_0 908 vslo 04 v d v a v b 1036 vsro 04 v d v a v b 1100 vpkuhum 04 v d v a v b14 vpkuwum 04 v d v a v b78 vpkuhus 04 v d v a v b 142 vpkuwus 04 v d v a v b 206 vpkshus 04 v d v a v b 270 vpkswus 04 v d v a v b 334 vpkshss 04 v d v a v b 398 vpkswss 04 v d v a v b 462 vupkhsb 04 v d 0_0000 v b 526 vupkhsh 04 v d 0_0000 v b 590 vupklsb 04 v d 0_0000 v b 654 vupklsh 04 v d 0_0000 v b 718 vpkpx 04 v d v a v b 782 vupkhpx 04 v d 0_0000 v b 846 vupklpx 04 v d 0_0000 v b 974 vxor 04 v d v a v b 1220 table e-3. x-form opcd v d v a v bxo 0 opcd v s v a v bxo 0 opcd t 0_0 strm ab xo 0 speci? instructions name 0 5678910111213141516171819202122232425262728293031 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
e-6 altivec technology programming environments manual motorola instructions sorted by form speci? instructions name 0 --------------------- 5 6 7 8 9 10 111213141516171819202122232425262728293031 dst 31 t 0_0 strm a b 342 0 dstt 31 1 0_0 strm a b 342 0 dstst 31 t 0_0 strm a b 374 0 dststt 31 1 0_0 strm a b 374 0 dss 31 a 0_0 strm 0_0000 0000_0 822 0 dssall 31 1 0_0 strm 0_0000 0000_0 822 0 lvebx 31 v dab 7 0 lvehx 31 v dab 39 0 lvewx 31 v dab 71 0 lvsl 31 v dab 6 0 lvsr 31 v dab 38 0 lvx 31 v d a b 103 0 lvxl 31 v d a b 359 0 stvebx 31 v s a b 135 0 stvehx 31 v s a b 167 0 stvewx 31 v s a b 199 0 stvx 31 v s a b 231 0 stvxl 31 v s a b 487 0 table e-4. vxr-form opcd v d v a v brc xo speci? instructions name 0 5678910111213141516171819202122232425262728293031 vcmpbfp x 04 v d v a v b rc 966 vcmpeqfp x 04 v d v a v b rc 198 vcmpequb x 04 v d v a v brc 6 vcmpequh x 04 v d v a v brc 70 vcmpequw x 04 v d v a v b rc 134 vcmpgefp x 04 v d v a v b rc 454 vcmpgtfp x 04 v d v a v b rc 710 vcmpgtsb x 04 v d v a v b rc 774 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix e. instructions sorted by form e-7 instructions sorted by form speci? instructions vcmpgtsh x 04 v d v a v b rc 838 vcmpgtsw x 04 v d v a v b rc 902 vcmpgtub x 04 v d v a v b rc 518 vcmpgtuh x 04 v d v a v b rc 582 vcmpgtuw x 04 v d v a v b rc 646 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
e-8 altivec technology programming environments manual motorola instructions sorted by form f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix f. instruction set legend f-1 appendix f instruction set legend f.1 instruction set legend table f-1 provides general information on the altivec instruction set such as the architectural level, privilege level, and form. table f-1. altivec instruction set legend uisa vea oea supervisor level optional form dss vx dssall vx dst vx dstst vx dststt vx dstt vx lvebx x lvehx x lvewx x lvsl x lvsr x lvx x lvxl x mfvscr vx mtvscr vx stvebx x stvehx x stvewx x stvx x stvxl x f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
f-2 altivec technology programming environments manual motorola instruction set legend vaddcuw vx vaddfp vx vaddsbs vx vaddshs vx vaddsws vx vaddubm vx vaddubs vx vadduhm vx vadduhs vx vadduwm vx vadduws vx vand vx vandc vx vavgsb vx vavgsh vx vavgsw vx vavgub vx vavguh vx vavguw vx vcfux vx vcfsx vx vcmpbfp x vxr vcmpeqfp x vxr vcmpequb x vxr vcmpequh x vxr vcmpequw x vxr vcmpgefp x vxr vcmpgtfp x vxr vcmpgtsb x vxr vcmpgtsh x vxr vcmpgtsw x vxr vcmpgtub x vxr vcmpgtuh x vxr table f-1. altivec instruction set legend (continued) uisa vea oea supervisor level optional form f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix f. instruction set legend f-3 instruction set legend vcmpgtuw x vxr vctsxs vx vctuxs vx vexptefp vx vlogefp vx vmaddfp va vmaxfp vx vmaxsb vx vmaxsh vx vmaxsw vx vmaxub vx vmaxuh vx vmaxuw vx vmhaddshs va vmhraddshs va vminfp vx vminsb vx vminsh vx vminsw vx vminub vx vminuh vx vminuw vx vmladduhm va vmrghb vx vmrghh vx vmrghw vx vmrglb vx vmrglh vx vmrglw vx vmsummbm va vmsumshm va vmsumshs va vmsumubm va table f-1. altivec instruction set legend (continued) uisa vea oea supervisor level optional form f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
f-4 altivec technology programming environments manual motorola instruction set legend vmsumuhm va vmsumuhs va vmulesb vx vmulesh vx vmuleub vx vmuleuh vx vmulosb vx vmulosh vx vmuloub vx vmulouh vx vnmsubfp va vnor vx vor vx vperm va vpkpx vx vpkshss vx vpkshus vx vpkswss vx vpkuhum vx vpkuhus vx vpkswus vx vpkuwum vx vpkuwus vx vrefp vx vr? vx vr? vx vr? vx vr? vx vrlb vx vrlh vx vrlw vx vrsqrtefp vx vsel va table f-1. altivec instruction set legend (continued) uisa vea oea supervisor level optional form f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix f. instruction set legend f-5 instruction set legend vsl vx vslb vx vsldoi va vslh vx vslo vx vslw vx vspltb vx vsplth vx vspltisb vx vspltish vx vspltisw vx vspltw vx vsr vx vsrab vx vsrah vx vsraw vx vsrb vx vsrh vx vsro vx vsrw vx vsubcuw vx vsubfp vx vsubsbs vx vsubshs vx vsubsws vx vsububm vx vsubuhm vx vsububs vx vsubuhs vx vsubuwm vx vsubuws vx vsumsws vx vsum2sws vx table f-1. altivec instruction set legend (continued) uisa vea oea supervisor level optional form f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
f-6 altivec technology programming environments manual motorola instruction set legend vsum4sbs vx vsum4shs vx vsum4ubs vx vupkhpx vx vupkhsb vx vupkhsh vx vupkhpx vx vupklsh vx vupklpx vx vupklsb vx vupklsh vx vxor vx table f-1. altivec instruction set legend (continued) uisa vea oea supervisor level optional form f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola appendix g. users manual revision history g-1 appendix g users manual revision history this appendix provides a list of the major differences between the altivec programming environments manual , revision 0 and revision 1. note that the list only covers the major changes to the users manual. only minor formatting upgrades comprised the changes in revision 2. no major changes were made top revision 1. the major changes to the altivec programming environments manual , revision 0, are as follows: section, page change 2.1.2, page 2-4 replace figure 2-4, ?aving/restoring the altivec context register (vrsave)?with the following: 2.2, page 2-9 figure 2-10?he vector registers are 128 bits wide not 64 bits wide as shown. 0123456789101112131415 field vr0 vr1 vr2 vr3 vr4 vr5 vr6 vr7 vr8 vr9 vr10 vr11 vr12 vr13 vr14 vr15 reset 0000_0000_0000_0000 r/w r/w using mfspr or mtspr instructions 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 field vr16 vr17 vr18 vr19 vr20 vr21 vr22 vr23 vr24 vr25 vr26 vr27 vr28 vr29 vr30 vr31 reset 0000_0000_0000_0000 r/w r/w using mfspr or mtspr instructions spr spr256 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
g-2 altivec technology programming environments manual motorola section, page no. changes 4.2.2.4, page 4-20 change table 4-9 as follows: the mnemonic for vector round to floating-point integer nearest should be vr? not fvr?. the mnemonic for vector round to floating-point integer toward zero should be vr? ,not fvr?. the mnemonic for vector round to floating-point integer toward positive in?ity should be vr? , not fvr?. the mnemonic for vector round to floating-point integer toward minus in?ity should be vr? , not fvr?. 6.2, page 6-24 change the mfvscr encoding as shown below (note: bit 31 is not 0): 6.2, page 6-25 change the mtvscr encoding as shown below (note: bit 31 is not 0): a.1, page a-2 change the mfvscr encoding as shown below (note: bit 31 is not 0): a.1, page a-2 change the mtvscr encoding as shown below (note: bit 31 is not 0 and v d should be v b): a.2, page a-9 change the mfvscr encoding as shown below (note: bit 31 is not 0): a.2, page a-9 change the mtvscr encoding as shown below (note: bit 31 is not 0): a.3, page a-14 change the mfvscr encoding as shown below (note: bit 31 is not 0): a.3, page a-14 change the mtvscr encoding as shown below (note: bit 31 is not 0): 04 v d 0 0 0 0 0 0 0 0 0 0 1540 056 10 11 15 16 20 21 31 04 0 0 0 0 0 0 0 0 0 0 v b 1604 056 10 11 15 16 20 21 31 mfvscr 04 v d 0 0 0 0 0 0 0 0 0 0 1540 mtvscr 04 0 0 0 0 0 0 0 0 0 0 v b 1604 mfvscr 000100 v d 0 0 0 0 0 0 0 0 0 0 110 0000 0100 mtvscr 000100 0 0 0 0 0 0 0 0 0 0 v b 110 0100 0100 mfvscr 04 v d 0 0 0 0 0 0 0 0 0 0 1540 mtvscr 04 0 0 0 0 0 0 0 0 0 0 v b 1604 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola glossary of terms and abbreviations glossary-1 glossary of terms and abbreviations the glossary contains an alphabetical list of terms, phrases, and abbreviations used in this book. some of the terms and de?itions included in the glossary are reprinted from ieee std 754-1985, ieee standard for binary floating-point arithmetic , copyright ?985 by the institute of electrical and electronics engineers, inc. with the permission of the ieee. a architecture. a detailed speci?ation of requirements for a processor or computer system. it does not specify details of how the processor or computer system must be implemented; instead it provides a template for a family of compatible implementations . asynchronous exception. exceptions that are caused by events external to the processors execution. in this document, the term ?synchronous exception is used interchangeably with the word interrupt . atomic access. a bus access that attempts to be part of a read-write operation to the same address uninterrupted by any other access to that address (the term refers to the fact that the transactions are indivisible). the powerpc architecture implements atomic accesses through the lwarx / stwcx. instruction pair. b bat (block address translation) mechanism. a software-controlled array that stores the available block address translations on-chip. beat. a single state on the 603e bus interface that may extend across multiple bus cycles. a 603e transaction can be composed of multiple address or data beats . biased exponent. an exponent whose range of values is shifted by a constant (bias). typically a bias is provided to allow a range of positive values to express a range that includes both positive and negative values. big-endian. a byte-ordering method in memory where the address n of a word corresponds to the most-signi?ant byte . in an addressed memory word, the bytes are ordered (left to right) 0, 1, 2, 3, with 0 being the most-signi?ant byte . see little-endian . f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
glossary-2 altivec technology programming environments manual motorola block. an area of memory that ranges from 128 kbyte to 256 mbyte whose size, translation, and protection attributes are controlled by the bat mechanism. boundedly unde?ed. a characteristic of certain operation results that are not rigidly prescribed by the powerpc architecture. boundedly- unde?ed results for a given operation may vary among implementations and between execution attempts in the same implementation. although the architecture does not prescribe the exact behavior for when results are allowed to be boundedly unde?ed, the results of executing instructions in contexts where results are allowed to be boundedly unde?ed are constrained to ones that could have been achieved by executing an arbitrary sequence of de?ed instructions, in valid form, starting in the state the machine was in before attempting to execute the given instruction. branch folding. the replacement with target instructions of a branch instruction and any instructions along the not-taken path when a branch is either taken or predicted as taken. branch prediction. the process of guessing whether a branch will be taken. such predictions can be correct or incorrect; the term ?redicted as it is used here does not imply that the prediction is correct (successful). the powerpc architecture de?es a means for static branch prediction as part of the instruction encoding. branch resolution. the determination of whether a branch is taken or not taken. a branch is said to be resolved when the processor can determine which instruction path to take. if the branch is resolved as predicted, the instructions following the predicted branch that may have been speculatively executed can complete. if the branch is not resolved as predicted, instructions on the mispredicted path, and any results of speculative execution, are purged from the pipeline and fetching continues from the nonpredicted path. burst. a multiple-beat data transfer whose total size is typically equal to a cache block. bus clock. clock that causes the bus state transitions. bus master. the owner of the address or data bus; the device that initiates or requests the transaction. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola glossary of terms and abbreviations glossary-3 c cache. high-speed memory containing recently accessed data or instructions (subset of main memory). cache block. a small region of contiguous memory that is copied from memory into a cache . the size of a cache block may vary among processors; the maximum block size is one page . in powerpc processors, cache coherency is maintained on a cache-block basis. note that the term ?ache block is often used interchangeably with ?ache line? cache coherency. an attribute wherein an accurate and common view of memory is provided to all devices that share the same memory system. caches are coherent if a processor performing a read from its cache is supplied with data corresponding to the most recent value written to memory or to another processors cache. cache ?sh. an operation that removes from a cache any data from a speci?d address range. this operation ensures that any modi?d data within the speci?d address range is written back to main memory. this operation is generated typically by a data cache block flush ( dcbf ) instruction. caching-inhibited. a memory update policy in which the cache is bypassed and the load or store is performed to or from main memory. cast out. a cache block that must be written to memory when a cache miss causes a cache block to be replaced. changed bit. one of two page history bits found in each page table entry (pte). the processor sets the changed bit if any store is performed into the page . see also page access history bits and referenced bit . clean. an operation that causes a cache block to be written to memory, if modi?d, and then left in a valid, unmodi?d state in the cache. clear. to cause a bit or bit ?ld to register a value of zero. see also set . context synchronization. an operation that ensures that all instructions in execution complete past the point where they can produce an exception , that all instructions in execution complete in the context in which they began execution, and that all subsequent instructions are fetched and executed in the new context. context synchronization may result from executing speci? instructions (such as isync or r ) or when certain events occur (such as an exception). f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
glossary-4 altivec technology programming environments manual motorola copy-back operation. a cache operation in which a cache line is copied back to memory to enforce cache coherency. copy-back operations consist of snoop push-out operations and cache cast-out operations. d denormalized number. a nonzero ?ating-point number whose exponent has a reserved value, usually the format's minimum, and whose explicit or implicit leading signi?and bit is zero. direct-mapped cache. a cache in which each main memory address can appear in only one location within the cache, operates more quickly when the memory request is a cache hit. direct-store segment access. an access to an i/o address space. the 603 de?es separate memory-mapped and i/o address spaces, or segments, distinguished by the corresponding segment register t bit in the address translation logic of the 603. if the t bit is cleared, the memory reference is a normal memory-mapped access and can use the virtual memory management hardware of the 603. if the t bit is set, the memory reference is a direct-store access. double-word swap . altivec processors implement a double-word swap when moving quad words between vector registers and memory. the double word swap performs an additional swap to keep vector registers and memory consistent in little-endian mode. double-word swap is referred to as ?wizzling in the altivec technology architecture speci?ation. this feature is not supported by the powerpc architecture. e effective address (ea). the 32-bit address speci?d for a load, store, or an instruction fetch. this address is then submitted to the mmu for translation to either a physical memory address. exception. a condition encountered by the processor that requires special, supervisor-level processing. exception handler. a software routine that executes when an exception is taken. normally, the exception handler corrects the condition that caused the exception, or performs some other meaningful task (that may include aborting the program that caused the exception). the address for each exception handler is identi?d by an exception vector offset de?ed by the architecture and a pre? selected via the msr. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola glossary of terms and abbreviations glossary-5 extended opcode . a secondary opcode ?ld generally located in instruction bits 21?0, that further de?es the instruction type. all powerpc instructions are one word in length. the most signi?ant 6 bits of the instruction are the primary opcode , identifying the type of instruction. see also primary opcode. exclusive state. mei state (e) in which only one caching device contains data that is also in system memory. execution synchronization. a mechanism by which all instructions in execution are architecturally complete before beginning execution (appearing to begin execution) of the next instruction. similar to context synchronization but doesn't force the contents of the instruction buffers to be deleted and refetched. exponent. in the binary representation of a ?ating-point number, the exponent is the component that normally signi?s the integer power to which the value two is raised in determining the value of the represented number. see also biased exponent . f feed-forwarding. a 603e feature that reduces the number of clock cycles that an execution unit must wait to use a register. when the source register of the current instruction is the same as the destination register of the previous instruction, the result of the previous instruction is routed to the current instruction at the same time that it is written to the register ?e. with feed-forwarding, the destination bus is gated to the waiting execution unit over the appropriate source bus, saving the cycles which would be used for the write and read. fetch. retrieving instructions from either the cache or main memory and placing them into the instruction queue. floating-point register (fpr). any of the 32 registers in the ?ating-point register ?e. these registers provide the source operands and destination results for ?ating-point instructions. load instructions move data from memory to fprs and store instructions move data from fprs to memory. the fprs are 64 bits wide and store ?ating-point values in double-precision format floating-point unit. the functional unit in the 603e processor responsible for executing all ?ating-point instructions. flush. an operation that causes a cache block to be invalidated and the data, if modi?d, to be written to memory. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
glossary-6 altivec technology programming environments manual motorola fraction. in the binary representation of a ?ating-point number, the ?ld of the signi?and that lies to the right of its implied binary point. fully associative . addressing scheme where every cache location (every byte) can have any possible address. g general-purpose register (gpr). any of the 32 registers in the general-purpose register ?e. these registers provide the source operands and destination results for all integer data manipulation instructions. integer load instructions move data from memory to gprs and store instructions move data from gprs to memory. guarded. the guarded attribute pertains to out-of-order execution. when a page is designated as guarded, instructions and data cannot be accessed out-of-order. h harvard architecture. an architectural model featuring separate caches and other memory management resources for instructions and data. hashing. an algorithm used in the page table search process. i ieee 754. a standard written by the institute of electrical and electronics engineers that de?es operations and representations of binary ?ating-point numbers. illegal instructions. a class of instructions that are not implemented for a particular powerpc processor. these include instructions not de?ed by the powerpc architecture. in addition, for 32-bit implementations, instructions that are de?ed only for 64-bit implementations are considered to be illegal instructions. for 64-bit implementations instructions that are de?ed only for 32-bit implementations are considered to be illegal instructions. implementation. a particular processor that conforms to the powerpc architecture, but may differ from other architecture-compliant implementations for example in design, feature set, and implementation of optional features. the powerpc architecture has many different implementations. implementation-dependent . an aspect of a feature in a processors design that is de?ed by a processors design speci?ations rather than by the powerpc architecture. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola glossary of terms and abbreviations glossary-7 implementation-speci? . an aspect of a feature in a processors design that is not required by the powerpc architecture, but for which the powerpc architecture may provide concessions to ensure that processors that implement the feature do so consistently. imprecise exception. a type of synchronous exception that is allowed not to adhere to the precise exception model (see precise exception ). the powerpc architecture allows only ?ating-point exceptions to be handled imprecisely. inexact . loss of accuracy in an arithmetic operation when the rounded result differs from the in?itely precise value with unbounded range. instruction queue. a holding place for instructions fetched from the current instruction stream. integer unit. the functional unit in the 603e responsible for executing all integer instructions. in-order. an aspect of an operation that adheres to a sequential model. an operation is said to be performed in-order if, at the time that it is performed, it is known to be required by the sequential execution model. see out-of-order . instruction latency. the total number of clock cycles necessary to execute an instruction and make ready the results of that instruction. instruction parallelism . a feature of powerpc processors that allows instructions to be processed in parallel. interrupt. an external signal that causes the 603e to suspend current execution and take a prede?ed exception. k key bits. a set of key bits referred to as ks and kp in each segment register and each bat register. the key bits determine whether supervisor or user programs can access a page within that segment or block . kill. an operation that causes a cache block to be invalidated without writing any modi?d data to memory. l latency. the number of clock cycles necessary to execute an instruction and make ready the results of that execution for a subsequent instruction. l2 cache. see secondary cache . f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
glossary-8 altivec technology programming environments manual motorola least-signi?ant bit (lsb). the bit of least value in an address, register, ?ld, data element, or instruction encoding. least-signi?ant byte (lsb). the byte of least value in an address, register, data element, or instruction encoding. little-endian. a byte-ordering method in memory where the address n of a word corresponds to the least-signi?ant byte . in an addressed memory word, the bytes are ordered (left to right) 3, 2, 1, 0, with 3 being the most-signi?ant byte . see big-endian . loop unrolling . loop unrolling provides a way of increasing performance by allowing more instructions to be issued in a clock cycle. the compiler replicates the loop body to increase the number of instructions executed between a loop branch. m mantissa. the decimal part of logarithm. mei (modi?d/exclusive/invalid). cache coherency protocol used to manage caches on different devices that share a memory system. note that the powerpc architecture does not specify the implementation of a mei protocol to ensure cache coherency. mesi (modi?d/exclusive/shared/invalid) . cache coherency protocol used to manage caches on different devices that share a memory system. note that the powerpc architecture does not specify the implementation of a mesi protocol to ensure cache coherency. memory access ordering. the speci? order in which the processor performs load and store memory accesses and the order in which those accesses complete. memory-mapped accesses. accesses whose addresses use the page or block address translation mechanisms provided by the mmu and that occur externally with the bus protocol de?ed for memory. memory coherency. an aspect of caching in which it is ensured that an accurate view of memory is provided to all devices that share system memory. memory consistency. refers to agreement of levels of memory with respect to a single processor and system memory (for example, on-chip cache, secondary cache, and system memory). f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola glossary of terms and abbreviations glossary-9 memory management unit (mmu). the functional unit that is capable of translating an effective (logical) address to a physical address, providing protection mechanisms, and de?ing caching methods. microarchitecture . the hardware details of a microprocessors design. such details are not de?ed by the powerpc architecture. mnemonic . the abbreviated name of an instruction used for coding. modi?d state. mei state (m) in which one, and only one, caching device has the valid data for that address. the data at this address in external memory is not valid. most-signi?ant bit (msb). the highest-order bit in an address, registers, data element, or instruction encoding. most-signi?ant byte (msb). the highest-order byte in an address, registers, data element, or instruction encoding. munging. a modi?ation performed on an effective address that allows it to appear to the processor that individual aligned scalars are stored as little-endian values, when in fact it is stored in big-endian order, but at different byte addresses within double words. note that munging affects only the effective address and not the byte order. note also that this term is not used by the powerpc architecture. multiprocessing . the capability of software, especially operating systems, to support execution on more than one processor at the same time. n nan. an abbreviation for not a number; a symbolic entity encoded in floating-point format. there are two types of nans?ignaling nans and quiet nans. no-op. no-operation. a single-cycle operation that does not affect registers or generate bus activity. normalization. a process by which a ?ating-point value is manipulated such that it can be represented in the format for the appropriate precision (single- or double-precision). for a ?ating-point value to be representable in the single- or double-precision format, the leading implied bit must be a 1. o oea (operating environment architecture). the level of the architecture that describes powerpc memory management model, supervisor-level registers, synchronization requirements, and the f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
glossary-10 altivec technology programming environments manual motorola exception model. it also de?es the time-base feature from a supervisor-level perspective. implementations that conform to the powerpc oea also conform to the powerpc uisa and vea. optional. a feature, such as an instruction, a register, or an exception, that is de?ed by the powerpc architecture but not required to be implemented. out-of-order. an aspect of an operation that allows it to be performed ahead of one that may have preceded it in the sequential model, for example, speculative operations. an operation is said to be performed out-of-order if, at the time that it is performed, it is not known to be required by the sequential execution model. see in-order . out-of-order execution. a technique that allows instructions to be issued and completed in an order that differs from their sequence in the instruction stream. over?w. an condition that occurs during arithmetic operations when the result cannot be stored accurately in the destination register(s). for example, if two 32-bit numbers are multiplied, the result may not be representable in 32 bits. since the 32-bit registers of the 603e cannot represent this sum, an over?w condition occurs. p page. a region in memory. the oea de?es a page as a 4-kbyte area of memory, aligned on a 4-kbyte boundary. page access history bits. the changed and referenced bits in the pte keep track of the access history within the page. the referenced bit is set by the mmu whenever the page is accessed for a read or write operation. the changed bit is set when the page is stored into. see changed bit and referenced bit . page fault. a page fault is a condition that occurs when the processor attempts to access a memory location that does not reside within a page not currently resident in physical memory . on powerpc processors, a page fault exception condition occurs when a matching, valid page table entry (pte[v] = 1) cannot be located. page table. a table in memory is comprised of page table entries , or ptes. it is further organized into eight ptes per pteg (page table entry group). the number of ptegs in the page table depends on the size of the page table (as speci?d in the sdr1 register). f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola glossary of terms and abbreviations glossary-11 page table entry (pte). data structures containing information used to translate effective address to physical address on a 4-kbyte page basis. a pte consists of 8 bytes of information in a 32-bit processor and 16 bytes of information in a 64-bit processor. park. the act of allowing a bus master to maintain bus mastership without having to arbitrate. persistent data stream . a data stream is considered to be persistent when it is expected to be loaded from frequently. physical memory. the actual memory that can be accessed through the systems memory bus. pipelining. a technique that breaks operations, such as instruction processing or bus transactions, into smaller distinct stages or tenures (respectively) so that a subsequent operation can begin before the previous one has completed. precise exceptions. a category of exception for which the pipeline can be stopped so instructions that preceded the faulting instruction can complete and subsequent instructions can be ?shed and redispatched after exception handling has completed. see imprecise exceptions . primary opcode. the most-signi?ant 6 bits (bits 0?) of the instruction encoding that identi?s the type of instruction. program order. the order of instructions in an executing program. more speci?ally, this term is used to refer to the original order in which program instructions are fetched into the instruction queue from the cache protection boundary. a boundary between protection domains . protection domain. a protection domain is a segment, a virtual page, a bat area, or a range of unmapped effective addresses. it is de?ed only when the appropriate relocate bit in the msr (ir or dr) is 1. q quad word . a group of 16 contiguous locations starting at an address divisible by 16. quiesce. to come to rest. the processor is said to quiesce when an exception is taken or a sync instruction is executed. the instruction stream is stopped at the decode stage and executing instructions are allowed to complete to create a controlled context for instructions that may be f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
glossary-12 altivec technology programming environments manual motorola affected by out-of-order, parallel execution. see context synchronization . quiet nan. a type of nan that can propagate through most arithmetic operations without signaling exceptions. a quiet nan is used to represent the results of certain invalid operations, such as invalid arithmetic operations on in?ities or on nans, when invalid. see signaling nan . r ra. the r a instruction ?ld is used to specify a gpr to be used as a source or destination. rb. the r b instruction ?ld is used to specify a gpr to be used as a source. rd. the r d instruction ?ld is used to specify a gpr to be used as a destination. rs. the r s instruction ?ld is used to specify a gpr to be used as a source. real address mode. an mmu mode when no address translation is performed and the effective address speci?d is the same as the physical address. the processors mmu is operating in real address mode if its ability to perform address translation has been disabled through the msr registers ir and/or dr bits. record bit. bit 31 (or the rc bit) in the instruction encoding. when it is set, updates the condition register (cr) to re?ct the result of the operation. referenced bit. one of two page history bits found in each page table entry (pte). the processor sets the referenced bit whenever the page is accessed for a read or write. see also page access history bits . register indirect addressing. a form of addressing that speci?s one gpr that contains the address for the load or store. register indirect with immediate index addressing. a form of addressing that speci?s an immediate value to be added to the contents of a speci?d gpr to form the target address for the load or store. register indirect with index addressing. a form of addressing that speci?s that the contents of two gprs be added together to yield the target address for the load or store. rename register. temporary buffers used by instructions that have ?ished execution but have not completed. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola glossary of terms and abbreviations glossary-13 reservation. the processor establishes a reservation on a cache block of memory space when it executes an lwarx instruction to read a memory semaphore into a gpr. reservation station. a buffer between the dispatch and execute stages that allows instructions to be dispatched even though the results of instructions on which the dispatched instruction may depend are not available. risc (reduced instruction set computing). an architecture characterized by ?ed-length instructions with nonoverlapping functionality and by a separate set of load and store instructions that perform memory accesses. s scan interface. the 603e test interface. secondary cache. a cache memory that is typically larger and has a longer access time than the primary cache. a secondary cache may be shared by multiple devices. also referred to as l2, or level-2, cache. set ( v ) . to write a nonzero value to a bit or bit ?ld; the opposite of clear . the term ?et may also be used to generally describe the updating of a bit or bit ?ld. set ( n ) . a subdivision of a cache . cacheable data can be stored in a given location in one of the sets, typically corresponding to its lower-order address bits. because several memory locations can map to the same location, cached data is typically placed in the set whose cache block corresponding to that address was used least recently. see set-associative . set-associative. aspect of cache organization in which the cache space is divided into sections, called sets . the cache controller associates a particular main memory address with the contents of a particular set, or region, within the cache. shadowing. shadowing allows a register to be updated by instructions that are executed out of order without destroying machine state information. signaling nan. a type of nan that generates an invalid operation program exception when it is speci?d as arithmetic operands. see quiet nan . f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
glossary-14 altivec technology programming environments manual motorola signi?and. the component of a binary ?ating-point number that consists of an explicit or implicit leading bit to the left of its implied binary point and a fraction ?ld to the right. simd . single instruction stream, multiple data streams. a vector instruction can operate on several data elements within a single instruction in a single functional unit. simd is a way to work with all the data at once (in parallel), which can make execution faster. simpli?d mnemonics. assembler mnemonics that represent a more complex form of a common operation. slave. the device addressed by a master device. the slave is identi?d in the address tenure and is responsible for supplying or latching the requested data for the master during the data tenure. snooping. monitoring addresses driven by a bus master to detect the need for coherency actions. snoop push. response to a snooped transaction that hits a modi?d cache block. the cache block is written to memory and made available to the snooping device. splat. a splat instruction will take one element and replicates (splats) that value into a vector register. the purpose being to have all elements have the same value so they can be used as a constant to multiply other vector registers. split - transaction. a transaction with independent request and response tenures. split-transaction bus. a bus that allows address and data transactions from different processors to occur independently. stage. the term ?tage is used in two different senses, depending on whether the pipeline is being discussed as a physical entity or a sequence of events. in the latter case, a stage is an element in the pipeline during which certain actions are performed, such as decoding the instruction, performing an arithmetic operation, or writing back the results. typically, the latency of a stage is one processor clock cycle. some events, such as dispatch, write-back, and completion, happen instantaneously and may be thought to occur at the end of a stage. an instruction can spend multiple cycles in one stage. an integer multiply, for example, takes multiple cycles in the execute stage. when this occurs, subsequent instructions may stall. an instruction may also occupy more than one stage f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola glossary of terms and abbreviations glossary-15 simultaneously, especially in the sense that a stage can be seen as a physical resource?or example, when instructions are dispatched they are assigned a place in the cq at the same time they are passed to the execute stage. they can be said to occupy both the complete and execute stages in the same clock cycle. stall. an occurrence when an instruction cannot proceed to the next stage. static branch prediction. mechanism by which software (for example, compilers) can hint to the machine hardware about the direction a branch is likely to take. sticky bit. a bit that when set must be cleared explicitly. superscalar machine. a machine that can issue multiple instructions concurrently from a conventional linear instruction stream. supervisor mode. the privileged operation state of a processor. in supervisor mode, software, typically the operating system, can access all control registers and can access the supervisor memory space, among other privileged operations. synchronization. a process to ensure that operations occur strictly in order . see context synchronization and execution synchronization . synchronous exception. an exception that is generated by the execution of a particular instruction or instruction sequence. there are two types of synchronous exceptions, precise and imprecise . system memory. the physical memory available to a processor. t tenure. the period of bus mastership. for the 603e, there can be separate address bus tenures and data bus tenures. a tenure consists of three phases: arbitration, transfer, and termination. tlb (translation lookaside buffer). a cache that holds recently-used page table entries . throughput. the measure of the number of instructions that are processed per clock cycle. tiny . a ?ating-point value that is too small to be represented for a particular precision format, including denormalized numbers; they do not include ?. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
glossary-16 altivec technology programming environments manual motorola transaction. a complete exchange between two bus devices. a transaction is typically comprised of an address tenure and one or more data tenures, which may overlap or occur separately from the address tenure. a transaction may be minimally comprised of an address tenure only. transfer termination. signal that refers to both signals that acknowledge the transfer of individual beats (of both single-beat transfer and individual beats of a burst transfer) and to signals that mark the end of the tenure. transient stream. a data stream is considered to be transient when it is likely to be referenced from infrequently. u uisa (user instruction set architecture). the level of the architecture to which user-level software should conform. the uisa de?es the base user-level instruction set, user-level registers, data types, ?ating-point memory conventions and exception model as seen by user programs, and the memory and programming models. under?w. a condition that occurs during arithmetic operations when the result cannot be represented accurately in the destination register. for example, under?w can happen if two ?ating-point fractions are multiplied and the result requires a smaller exponent and/or mantissa than the single-precision format can provide. in other words, the result is too small to be represented accurately. user mode. the operating state of a processor used typically by application software. in user mode, software can access only certain control registers and can access only user memory space. no privileged operations can be performed. also referred to as problem state. v va . the v a instruction ?ld is used to specify a vector register to be used as a source or destination. vb . the v b instruction ?ld is used to specify a vector register to be used as a source. vc . the v c instruction ?ld is used to specify a vector register to be used as a source. vd . the v d instruction ?ld is used to specify a vector register to be used as a destination. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola glossary of terms and abbreviations glossary-17 vs . the v s instruction ?ld is used to specify a vector register to be used as a source. vea (virtual environment architecture). the level of the architecture that describes the memory model for an environment in which multiple devices can access memory, de?es aspects of the cache model, de?es cache control instructions, and de?es the time-base facility from a user-level perspective. implementations that conform to the powerpc vea also adhere to the uisa, but may not necessarily adhere to the oea. vector . the spatial parallel processing of short, ?ed-length one-dimensional matrices performed by an execution unit. vector register (vr) . any of the 32 registers in the vector register ?e. each vector register is 128 bits wide. these registers can provide the source operands and destination results for altivec instructions. virtual address. an intermediate address used in the translation of an effective address to a physical address. virtual memory. the address space created using the memory management facilities of the processor. program access to virtual memory is possible only when it coincides with physical memory . w way. a location in the cache that holds a cache block, its tags and status bits. weak ordering . a memory access model that allows bus operations to be reordered dynamically, which improves overall performance and in particular reduces the effect of memory latency on instruction throughput. word. a 32-bit data element. write-back. a cache memory update policy in which processor write cycles are directly written only to the cache. external memory is updated only indirectly, for example, when a modi?d cache block is cast out to make room for newer data. write-through. a cache memory update policy in which all processor write cycles are written to both the cache and memory. f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
glossary-18 altivec technology programming environments manual motorola f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
index motorola index index-1 a acronyms and abbreviated terms, list, xxiv address bus address calculation, 4-26 address modes, 1-9 address translation for streams, 5-7 alignment aligned scalars, le mode, 3-4 effective address, 4-26 load and store, 4-26 load instruction support, 4-29 memory access and vector register, 3-6 misaligned accesses, 3-1 misaligned vectors, 3-7 partially executed instructions, 5-10 quad-word data alignment, 3-7 rules, 3-4 altivec technology address modes, 1-9 cache overview, 1-12 exception handling, 1-12 features list, 1-4 features not de?ed, 1-6 instruction set, 1-11, 6-9, a-1-f-6 instruction set architecture support, 1-5 interelement operations, 1-9 intraelement operations, 1-9 levels of the powerpc architecture, 1-5 operations supported, 1-9 overview, 1-3 powerpc architecture extension, 1-2 programming model, 1-6 register ?e structure, 2-4 register set, 1-6, 2-4, 2-8 simd-style extension, 1-3, 1-7 structural overview, 1-4 arithmetic instructions ?ating-point, 4-19 integer, 4-1 b big-endian mode accessing a misaligned quad word, 3-8 byte ordering, 1-7, 3-3 concept, 3-3 mapping, quad word, 3-3 misaligned vector, 3-7 mixed-endian systems, 3-12 block count, 5-2 block size, 5-2 block stride, 5-2 byte ordering aligned scalars, le mode, 3-4 big-endian mode, default, 3-3 concept, 3-2 default, 1-7 le bit in msr, 3-3 least-signi?ant byte (lsb), 3-3 little-endian mode description, 3-3 most-signi?ant byte (msb), 3-3 quad-word example, 3-3 c cache cache management instructions, 4-42 data stream touch, 5-2 dss instruction, 5-5 dst instruction, 5-2 dstst instruction, 5-4 dstt instruction, 5-4 overview, 1-12, 5-1 prefetch, software-directed, 5-2 prioritizing cache block replacement, 5-9 stopping streams, 5-5 storing to streams, 5-4 transient streams, 5-4 cache management instructions, 4-42 classes of instructions, 4-2 compare instructions ?ating-point, 4-22 integer, 4-13, 4-14 computation modes powerpc architecture support, 4-2 conventions, xxiii classes of instructions, 4-2 computation modes, 4-2 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
index-2 altivec technology programming environments manual motorola execution model, 4-2 memory addressing, 4-3 operand conventions, 3-1 terminology, xxvii cr (condition register) bit ?lds, 2-8 cr6 ?ld, compare instructions, 2-8 move to/from cr instructions, 4-40 d data organization, memory, 3-1 data stream, 5-2 double-word swap, 3-6 e echo cancellation, 1-2 effective address calculation ea modi?ations, 3-5 loads and stores, 4-26 overview, 4-3 estimate instructions, 4-24 exceptions data address breakpoint, 5-10 dsi exception, 5-10 exception behavior of prefetch streams, 5-6 exception handling, 1-12 ?ating-point exceptions, 3-14 invalid operation exception, 3-16 log of zero exception, 3-16 nan operand exception, 3-15 over?w exception, 3-17 overview, 5-1 precise exceptions, 5-12 priorities, 5-12 synchronous exceptions, 5-12 unavailable exception, 5-10 under?w exception, 3-17 zero divide exception, 3-16 exclusive or (xor), 3-4 execution model conventions, 4-2 ?ating-point, 3-12 extended mnemonics, see simpli?d mnemonics f features list altivec technology features, 1-4 features not de?ed, 1-6 floating-point model arithmetic instructions, 4-19 compare instructions, 4-22 division function, 4-18 estimate instructions, 4-24 exceptions, 3-14 execution model, 3-12 in?ities, 3-14 instructions, overview, 4-17 java mode, 3-13 modes, 3-13 multiply-add instructions, 4-20 nans, 3-17 non-java mode, 3-14 rounding mode, 3-14 rounding/conversion instructions, 4-21 square root functions, 4-19 formatting instructions, 4-31 h high-order byte numbering, 1-8 i instructions cache management instructions, 4-42 classes of instructions, 4-2 computation modes, 4-2 control ?w, 4-31 conventions, xxvii, 6-2 detailed descriptions, 6-9-6-177 ?ating-point arithmetic, 4-19 compare, 4-22 computational instructions, 3-12 division function, 4-18 estimate instructions, 4-24 multiply-add, 4-20 noncomputational instructions, 3-12 overview, 4-17 rounding/conversion, 4-21 square root functions, 4-19 format, lists, e-1 formats, 6-1 formatting instructions, 4-31 general information, f-1, g-1 integer arithmetic, 4-1, 4-4 compare, 4-13, 4-14 load, 4-27 logical, 4-1, 4-15 rotate/shift, 4-16 store, 4-30 listed by format, e-1 listed by mnemonic, 6-9-6-177, a-1 listed by opcode, c-1, d-1 load and store address generation, integer, 4-26 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola index index-3 integer load, 4-27 integer store, 4-30 memory addressing, 4-3 memory control instructions, 4-41 merge instructions, 4-34 mnemonics, lists, a-1 notations, 6-2 opcodes, lists, c-1, d-1 overview, 1-11 pack instructions, 4-31 partially executed instructions, 5-10 permutation instructions, 4-31 permute instructions, 4-36 powerpc instructions, list, a-1, b-1 processor control instructions, 4-39 quick reference, f-1, g-1 select instruction, 4-36 shift instructions, 4-37 splat instructions, 4-35 syntax conventions, xxvii, 6-2 unpack instructions, 4-33 vector integer, see integer integer instructions arithmetic instructions, 4-1, 4-4 compare instructions, 4-13, 4-14 logical instructions, 4-1, 4-15 rotate/shift instructions, 4-16 store instructions, 4-30 integer load instructions, 4-27 interelement operations, 1-9 intraelement operations, 1-9 invalid operation exception, 3-16 j java mode, 3-13 l little-endian mode accessing a misaligned quad word, 3-10 byte ordering, 3-3 description, 3-3 mapping, quad word, 3-4 misaligned vector, 3-7 mixed-endian systems, 3-12 swapping, 3-6 load/store address generation, integer, 4-26 integer load instructions, 4-27 integer store instructions, 4-30 log of zero exception, 3-16 logical instructions, integer, 4-1, 4-15 low-order byte numbering, 1-8 m mathematical predicates, 4-23 memory addressing, 4-3 memory control instructions, 4-41 memory management unit (mmu) memory bandwidth, 5-1 overview, 1-12, 5-1 prefetch data stream touch, 5-2 dss instruction, 5-5 dst instruction, 5-2 dstst instruction, 5-4 dstt instruction, 5-4 exception behavior, 5-6 software-directed, 5-2 stopping streams, 5-5 storing to streams, 5-4 transient streams, 5-4 memory operands, 4-3 memory sharing, 5-1 memory, data organization, 3-1 merge instructions, 4-34 misalignment accessing a quad word big-endian mode, 3-8 little-endian mode, 3-10 misaligned accesses, 3-1 misaligned vectors, 3-7 mixed-endian systems, 3-12 modulo mode, 4-4 move to/from cr instructions, 4-40 msr (machine state register) bit settings, 2-9 le bit, 3-3 multiply-add instructions, 4-20 munging, description, 3-4 n nan (not a number) conversion to integer, 3-18 ?ating-point nans, 3-17 operand exception, 3-15 precedence, 3-18 production, 3-18 non-java mode, 3-14 o oea (operating environment architecture) de?ition, xx programming model, 2-2 operands conventions, description, 1-7, 3-1 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
index-4 altivec technology programming environments manual motorola ?ating-point conventions, 1-8 memory operands, 4-3 operating environment architecture, see oea operations interelement operations, 1-9 intraelement operations, 1-9 over?w exception, 3-17 p pack instructions, 4-31 permutation instructions, 4-31 permute instructions, 4-36 powerpc architecture support computation modes, 4-2 execution model, 4-2 features summary de?ed features, 1-4 features not de?ed, 1-6 instruction list, a-1, b-1 levels of the powerpc architecture, 1-5 operating environment architecture, xx programming model, 1-6 registers affected by altivec technology, 2-8 user instruction set architecture, xix, 1-5 virtual environment architecture, xix, 1-5 prefetch, software-directed, 5-2 processor control instructions, 4-39 q qnan arithmetic, 3-18 r record bit (rc), 6-2 registers cr, 2-8 overview, 1-6, 2-1 powerpc register set, 2-1, 2-8 register ?e, 2-4 srr0/srr1, 2-10 vrs, 2-4 vrsave, 2-6 vscr, 2-4 rotate instructions, 4-16 rounding/conversion instructions, fp, 4-21 s saturation detection, 4-4 scalars aligned, le mode, 3-4 loads and stores, 3-11 misaligned loads and stores, 3-11 segment registers t bit, glossary-4 select instruction, 4-36 shift instructions, 4-16, 4-37 simd-style extension, 1-3, 1-7 simpli?d mnemonics, 4-40 snan arithmetic, 3-18 splat instructions, 4-35 srr0/srr1 (status save/restore registers), 2-10 streams address translation, 5-7 de?ition, 5-3 implementation assumptions, 5-9 synchronization, 5-7 usage notes, 5-7 stride, 5-2 swizzle, see double-word swap synchronization streams, 5-7 t terminology conventions, xxvii transient streams, 5-4 u uisa (user instruction set architecture), xix, 1-5 programming model, 2-2 under?w exception, 3-17 unpack instructions, 4-33 user instruction set architecture, see uisa v vea (virtual environment architecture) de?ition, xix, 1-5 programming model, 2-2 user-level cache control instructions, 4-41 vector formatting instructions, 4-31 vector integer compare instructions, see integer compare instructions vector merge instructions, 4-34 vector pack instructions, 4-31 vector permutation instructions, 4-31 vector permute instructions, 4-36 vector select instruction, 4-36 vector shift instructions, 4-37 vector splat instructions, 4-35 vector unpack instructions, 4-33 virtual environment architecture, see vea vrs (vector registers) memory access alignment and vr, 3-6 register ?e, 2-4 vrsave register, 2-6 vscr (vector status and control register), 2-4 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
motorola index index-5 x xor (exclusive or), 3-4 z zero divide exception, 3-16 f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
index-6 altivec technology programming environments manual motorola f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
1 2 3 4 5 a 6 glo ind b c d e f overview altivec register set operand conventions addressing modes and instruction set summary cache, exceptions, and memory management altivec instructions glossary of terms and abbreviations index appendix a: instruction set mnemonics - decimal appendix b: instruction set mnemonics - binary appendix c: opcodes - decimal appendix d: opcodes - binary appendix e: forms appendix f: legends g appendix g: revision history f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .
1 2 3 4 5 a 6 glo ind b c d e f overview altivec register set operand conventions addressing modes and instruction set summary cache, exceptions, and memory management altivec instructions glossary of terms and abbreviations index appendix a: instruction set mnemonics - decimal appendix b: instruction set mnemonics - binary appendix c: opcodes - decimal appendix d: opcodes - binary appendix e: forms appendix f: legends g appendix g: revision history f r e e s c a l e s e m i c o n d u c t o r , i freescale semiconductor, inc. f o r m o r e i n f o r m a t i o n o n t h i s p r o d u c t , g o t o : w w w . f r e e s c a l e . c o m n c . . .

▲Up To Search▲

Price & Availability of ALTIVECPEM

	To Download ALTIVECPEM Datasheet File
If you can't view the Datasheet, Please click here to try to view without PDF Reader .